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General  preface 


The  aim  of  the  publication  of  this  series  of  monographs,  known  under  the 
collective  title  of  'Frontiers  cf  Biology',  is  to  present  coherent  and  up-to-date 
views  of  the  fundamental  concepts  which  dominate  modern  biology. 

Biology  in  its  widest  sense  has  made  very  great  advances  during  the  past 
decade,  and  the  rate  of  progress  has  been  steadily  accelerating.  Undoubtedly 
important  factors  in  this  acceleration  have  been  the  effective  use  by  biologists 
of  new  techniques,  including  electron  microscopy,  isotopic  labels,  and  a 
great  variety  of  physical  and  chemical  techniques,  especially  those  with 
varying  degrees  of  automation.  In  addition,  scientists  with  partly  physical  or 
chemical  backgrounds  have  become  interested  in  the  great  variety  of  prob- 
lems presented  by  living  organisms.  Most  significant,  however,  increasing 
interest  in  and  understanding  of  the  biology  of  the  cell,  especially  in  regard 
to  the  molecular  events  involved  in  genetic  phenomena  and  in  metabolism 
and  its  control,  have  led  to  the  recognition  of  patterns  common  to  all  forms 
of  life  from  bacteria  to  man.  These  factors  and  unifying  concepts  have  led 
to  a situation  in  which  the  sharp  boundaries  between  the  various  classical 
biological  disciplines  are  rapidly  disappearing. 

Thus,  while  scientists  are  becoming  increasingly  specialized  in  their 
techniques,  to  an  increasing  extent  they  need  an  intellectual  and  conceptual 
approach  on  a wide  and  non-specialized  basis.  It  is  with  these  considerations 
and  needs  in  mind  that  this  series  of  monographs,  'Frontiers  cf  Biology'  has 
been  conceived. 

The  advances  in  various  areas  of  biology,  including  microbiology, 
biochemistry,  genetics,  cytology,  and  cell  structure  and  function  in  general 
will  be  presented  by  authors  who  have  themselves  contributed  significantly 
to  these  developments.  They  will  have,  in  this  series,  the  opportunity  of 
bringing  together,  from  diverse  sources,  theories  and  experimental  data, 
and  of  integrating  these  into  a more  general  conceptual  framework.  It  is 
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unavoidable,  and  probably  even  desirable,  that  the  special  bias  of  the  indi- 
vidual authors  will  become  evident  in  their  contributions.  Scope  will  also  be 
given  for  presentation  of  new  and  challenging  ideas  and  hypotheses  for 
which  complete  evidence  is  at  present  lacking.  However,  the  main  emphasis 
will  be  on  fairly  complete  and  objective  presentation  of  the  more  important 
and  more  rapidly  advancing  aspects  of  biology.  The  level  will  be  advanced, 
directed  primarily  to  the  needs  of  the  graduate  students  and  research 
worker. 

Most  monographs  in  this  series  will  be  in  the  range  of  200-300  pages, 
but  on  occasion  a collective  work  of  major  importance  may  be  included 
somewhat  exceeding  this  figure.  The  intent  of  the  publishers  is  to  bring  out 
these  books  promptly  and  in  fairly  quick  succession. 

It  is  on  the  basis  of  all  these  various  considerations  that  we  welcome  the 
opportunity  of  supporting  the  publication  of  the  series  'Frontiers  of  Biology' 
by  North-Holland  Publishing  Company. 

E.  L.  Tatum 
A.  Neuberger,  Editors 
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Foreword 


The  study  of  evolution,  like  so  much  of  biology,  has  been  suddenly  enriched 
by  the  sudden  eruption  and  rapid  diffusion  of  molecular  knowledge- knowl- 
edge with  a generality,  depth,  precision,  and  satisfying  simplicity  almost 
unique  in  the  biological  sciences. 

The  most  basic  process  in  evolution  is  the  change  in  frequency  of  in- 
dividual genes  and  the  emergence  of  novel  types  by  mutation  and  duplication. 
Yet,  evolutionists  have  had  to  be  content  with  inferences  about  these 
processes  based  on  observation  of  phenotypes,  inferences  that  have  usually 
been  indirect  and  uncertain.  Molecular  genetics  is  rapidly  remedying  this  by 
providing  an  ever-increasing  battery  of  techniques  for  the  direct  assay  of 
genotypes.  Moreover,  the  traditional  limitation  of  classical  genetics  - the 
inability  to  perform  breeding  experiments  between  species  that  cannot  be 
hybridized  - has  been  removed.  Gene  comparisons  between  monkeys  and 
humans,  between  vertebrates  and  invertebrates,  between  animals  and  plants, 
and  even  between  eukaryotes  and  prokaryotes  are  now  routine,  thanks  to  a 
molecular  methodology  that  bypasses  Mendelian  analysis.  Furthermore,  the 
time  scale  of  genetic  analysis  has  been  totally  changed.  We  can  now  make 
reliable  inferences  about  the  genes  responsible  for  histone  and  transfer  RNA 
in  our  ancestors  2 ~ 3 billion  years  ago. 

Population  genetics  and  intra-species  evolution  has  a mathematical  theory 
that  in  comparison  with  that  in  most  biology  is  rich  indeed.  Yet  it  is  a 
frequent  criticism  that  experimental  study  has  not  been  closely  tied  to  the 
theory.  One  reason  for  this  is  that  some  of  the  best  of  the  mathenlatics 
developed  by  the  founding  trio,  Wright,  Fisher,  and  Haldane  - particularly 
the  stochastic  theory  - is  most  appropriate  to  individual  genes  observed  for 
long  time  periods,  and  suitable  data  have  been  hard  to  obtain.  This  is 
equally  true  for  Malecot’s  elegant  treatment  of  geographical  structure,  built 
on  the  concept  of  gene  identity  and  its  decrease  with  distance.  Molecular 
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studies  have  not  only  increased  the  relevance  of  existing  theory,  but  have 
stimulated  new  developments,  particularly  with  regard  to  the  stochastic  fate 
of  individual  mutants,  an  area  in  which  the  name  of  Kimura  stands  out. 

Of  course,  evolutionary  biology  is  not  concerned  solely  with  changes  of 
the  individual  gene  or  nucleotide.  Biologists  are  also  interested  in  the  evolu- 
tion of  form  and  function,  in  whole  organisms  and  populations  of  whole 
organisms.  It  is  a truism  that  natural  selection  acts  on  phenotypes,  not  on 
individual  genes.  Many  evolutionists  are  properly  concerned  with  the 
evolution  of  such  interesting  and  complex  hypertrophies  as  the  elephant 
snout  and  the  human  forebrain,  more  than  with  the  causative  DNA.  There 
are  also  problems  of  chromosome  organization,  of  the  role  of  linkage  and 
recombination,  of  the  evolution  of  quantitative  traits  and  of  fitness  itself,  of 
the  different  forms  of  reproduction,  of  geographical  structure,  of  adaptation 
to  different  habitats,  and  a host  of  others.  Their  investigation  can  proceed 
with  a firmer  understanding  of  the  underlying  molecular  phenomena. 

The  emphasis  in  this  book  is  on  those  aspects  of  evolution  that  are  revealed 
by  molecular  methodology.  There  is  a pressing  need  to  summarize  and 
organize  the  bewildering  collection  of  facts  that  have  been  discovered  in  the 
past  few  years,  and  to  relate  these  to  the  theory,  classical  and  new,  that  can 
provide  understanding  and  coherence.  It  is  appropriate  that  such  a book  be 
written  by  one  who  is  himself  a leader  in  developing  and  applying  the  theory. 
Di . Nei  has  given  a complete  and  lucid  summary  of  the  relevant  theory  along 
with  an  abundance  of  data  from  widely  diverse  sources.  It  is  appropriate, 
even  essential,  that  a book  in  a rapidly  moving  field  be  up  to  date.  This 
one  is;  in  fact  the  author's  wide  acquaintance  has  permitted  the  inclusion  of 
considerable  material  not  yet  published. 

This  book  will  be  especially  useful  to  those,  both  in  the  field  and  outside 
it,  who  are  trying  to  keep  abreast  of  recent  developments.  They  will  discover 
that  molecular  biology,  while  providing  unexpected  solutions  to  old  problems, 
has  raised  some  equally  unexpected  new  ones. 


JAMES  F.  CROW 
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In  the  last  decade  the  progress  of  molecular  biology  has  made  a strong 
influence  on  the  theoretical  framework  of  population  genetics  and  evolution. 
Introduction  of  molecular  techniques  in  this  area  has  resulted  in  many  new 
discoveries.  As  a result,  a new  interdisciplinary  science,  which  may  be  called 
'Molecular  Population  Genetics  and  Evolution',  has  emerged.  In  this  book 
I have  attempted  to  discuss  the  development  and  outline  of  this  science. 

In  recent  years  a large  number  of  papers  have  been  published  on  this 
subject.  In  this  book  I have  not  particularly  attempted  to  cover  all  these 
papers.  Rather,  I have  tried  to  find  the  general  principles  behind  the  new 
observations  and  theoretical  (mathematical)  studies.  I have  also  tried  to 
understand  this  subject  in  the  background  of  classical  population  genetics 
and  evolution. 

In  the  development  of  molecular  population  genetics  and  evolution  the 
interplay  between  observation  and  theory  was  very  important.  I have  there- 
fore discussed  both  experimental  and  theoretical  studies.  Chapters  4 and  5 
are  devoted  mostly  to  the  mathematical  theory  of  population  genetics,  while 
in  the  other  chapters  empirical  data  are  discussed  in  the  light  of  theory.  It 
should  be  noted  that  the  genetic  change  of  population  is  affected  by  so  many 
factors,  that  it  is  difficult  to  understand  the  whole  process  of  evolutionary 
change  without  the  aid  of  mathematical  models.  On  the  other  hand,  mathe- 
matical studies  are  always  abstract  and  depend  on  some  simplifying  assump- 
tions, of  which  the  validity  must  be  tested  by  empirical  data. 

The  mathematics  used  in  this  book  is  not  very  sophisticated.  The  reader 
who  has  a knowledge  of  calculus  and  probability  theory  should  be  able  to 
understand  the  whole  book.  In  some  sections  of  chapter  5,  however,  I have 
given  only  the  mathematical  framework  of  the  model  used  and  the  final 
formulae.  The  reader  who  is  interested  in  the  derivation  may  refer  to  the 
original  papers  cited.  Whenever  there  are  several  alternative  methods 
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available  to  derive  a formula,  I have  used  the  simplest  one,  though  it  may  not 
be  mathematically  rigorous.  1 have  included  only  those  theories  that  are 
directly  related  to  our  subject  and  applicable  for  data  analysis  or  theoretical 
inference. 

This  book  has  grown  out  of  a course  for  graduate  students  given  at  Brown 
University  in  1971.  Parts  of  this  book  were  also  presented  in  a course  at  the 
University  of  Texas  at  Houston.  The  attendants  of  these  courses  were 
heterogeneous  and  came  from  both  biology  and  applied  mathematics 
departments.  In  these  courses  I made  an  effort  to  make  this  subject  under- 
standable to  both  biologists  and  applied  mathematicians.  I hope  that  this 
effort  has  remained  in  this  book.  The  reader  who  does  not  care  for  mathe- 
matical details  may  skip  chapters  4 and  5.  Most  of  the  biologically  important 
subjects  are  discussed  in  chapters  2,  3,  6,  7,  and  8 without  using  advanced 
mathematics. 

I would  like  to  take  this  opportunity  to  express  my  indebtedness  to 
Motoo  Kimura,  whose  writing  and  advice  not  only  introduced  me  into  the 
field  of  population  genetics  but  also  guided  my  work  on  this  subject. 
Moreover,  he  was  kind  enough  to  read  the  first  draft  of  this  manuscript  and 
made  valuable  comments.  My  thanks  also  go  to  Ranajit  Chakraborty,  James 
Crow,  Daniel  Hartl,  Donald  Levin,  Wen-Hsiung  Li,  Takeo  Maruyama, 
Robert  Selander,  Yoshio  Tateno,  Martin  Tracey,  and  Kenneth  Weiss  for 
reading  the  whole  or  various  parts  of  the  manuscript  and  making  valuable 
comments.  I am  indebted  to  Arun  Roychoudhury  and  Yoshio  Tateno  for 
their  help  in  data  analysis.  Special  gratitude  is  expressed  to  Mrs.  Kathleen 
Ward  who,  with  untiring  effort,  typed  all  the  manuscript  and  checked  the 
references. 

Unpublished  works  included  in  this  book  were  supported  by  U.S.  Public 
Health  Service  Grant  GM  20293. 
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Introduction 


Any  species  of  organism  in  nature  lives  in  a form  of  population.  A population 
of  organisms  is  characterized  by  some  sort  of  cooperative  or  inhibitory 
interaction  between  members  of  the  population.  Thus,  the  rate  of  growth  of  a 
population  depends  on  the  population  size  or  density  in  addition  to  the 
physical  environment  in  which  the  population  is  placed.  When  population 
density  is  below  a certain  level,  the  members  of  the  population  often  interact 
cooperatively,  while  in  a high  density  they  interact  inhibitorily.  In  organisms 
with  separate  sexes,  mating  between  males  and  females  is  essential  for  the 
survival  of  a population.  Interactions  between  individuals  are  not  confined 
within  a single  species  but  also  occur  between  different  species.  The  survival 
of  a species  generally  depends  on  the  existence  of  many  other  species  which 
serve  as  food,  mediator  of  mating,  shelter  from  physical  and  biological 
hazards,  etc. 

A population  of  organisms  has  properties  or  characteristics  that  transcend 
the  characteristics  of  an  individual.  The  growth  of  a population  is  certainly 
different  from  that  of  an  individual.  The  differences  between  ethnic  groups 
of  man  can  be  described  only  by  the  distributions  of  certain  quantitative 
characters  or  by  the  frequencies  of  certain  identifiable  genes.  All  these 
measurements  are  characteristics  of  populations  rather  than  of  individuals. 

Population  genetics  is  aimed  to  study  the  genetic  structure  of  populations 
and  the  laws  by  which  the  genetic  structure  changes.  By  genetic  structure  we 
mean  the  types  and  frequencies  of  genes  or  genotypes  present  in  the  popula- 
tion. Natural  populations  are  often  composed  of  many  subpopulations  or  of 
individuals  which  are  distributed  more  or  less  uniformly  in  an  area.  In  this 
case  the  genetic  structure  of  populations  must  be  described  by  taking  into 
account  the  geographical  distribution  of  gene  or  genotype  frequencies.  The 
genetic  structure  of  a population  is  determined  by  a large  number  of  loci. 
At  the  present  time,  however,  only  a small  proportion  of  the  genes  present 
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in  higher  organisms  have  been  identified.  Therefore,  our  knowledge  of  the 
genetic  structure  of  a population  is  far  from  complete.  Nevertheless,  it  is 
important  and  meaningful  to  know  the  frequencies  of  genes  or  genotypes 
with  respect  to  a certain  biologically  important  locus  or  a group  of  loci.  For 
example,  sickle  cell  anemia  in  man  is  controlled  by  a single  locus,  and  the 
frequency  changes  of  this  disease  in  populations  can  be  studied  without 
regard  to  other  gene  loci. 

Evolution  is  a process  of  successive  transformation  of  the  genetic  structure 
of  populations.  Therefore,  the  theory  of  population  genetics  plays  an  im- 
portant role  in  the  study  of  mechanisms  of  evolution.  The  basic  factors  for 
evolution  are  mutation,  gene  duplication,  naturalselection,  and  random  genetic 
drift.  In  adaptive  evolution  recombination  of  genes  is  also  important  in 
speeding  up  the  evolution.  However,  the  manner  in  which  these  factors 
interact  with  each  other  in  building  up  various  novel  morphological  and 
physiological  characters  is  not  well  understood.  For  example,  sexual 
reproduction  is  widespread  among  the  present  organisms,  but  the  very 
initial  step  of  the  evolution  of  sexual  reproduction  is  virtually  unknown.  The 
evolutionary  mechanisms  of  repeated  DNA  in  higher  organisms  or  F-factor, 
lysogenesis,  etc.  in  bacteria  are  also  mysterious.  In  the  study  of  evolution  it 
is  important  to  know  the  detailed  evolutionary  pathways  or  phylogenies  of 
different  organisms  with  reasonable  estimates  of  evolutionary  time.  The 
eventual  goal  of  the  study  of  evolution  is  to  understand  all  the  processes  of 
evolution  quantitatively  and  be  able  to  predict  and  control  the  future  evolu- 
tion of  organisms.  At  the  present  time  our  understanding  of  evolutionary 
processes  is  far  from  this  goal,  but  substantial  progress  has  been  made  in 
recent  years. 

Any  theory  in  natural  science  is  established  through  a two-step  procedure, 
i.e.  making  a hypothesis  and  testing  the  hypothesis  by  observations  or 
experiments.  A direct  test  of  a hypothesis  in  evolutionary  studies  is  often 
difficult  because  evolution  is  generally  a slow  process  compared  with  our 
lifetime.  However,  there  are  indirect  ways  of  testing  a hypothesis.  In  some 
cases  it  is  sufficient  to  examine  the  data  obtained  in  paleontology,  bio- 
geography, comparative  biochemistry,  etc.  In  some  other  cases  amathematical 
method  is  used  to  make  deductions  from  a hypothesis  and  then  the  deduc- 
tions are  compared  with  the  existing  data  from  paleontology,  population 
biology,  etc. 

Until  recently  population  genetics  was  concerned  mainly  with  rather 
short-term  changes  of  genetic  structure  of  populations.  This  is  because  our 
lifetime  is  very  short  compared  with  evolutionary  time.  The  process  of 
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long-tern-evolution  was  simply  conjectured  as  a continuation  of  short-term 
changes.  There  was  no  way  to  trace  the  genetic  change  of  a population  or  the 
evolutionary  change  of  a gene  through  long-tern -evolution.  The  development 
of  molecular  biology  in  the  last  two  decades  has  changed  this  situation 
drastically.  Now  the  evolutionary  change  of  at  least  some  genes  can  be 
traced  in  considerable  detail  by  studying  the  genetic  material  DNA  or  its 
direct  products  RNA  and  proteins  in  different  species.  This  has  enabled 
population  geneticists  to  evaluate  the  evolutionary  changes  of  populations 
more  quantitatively  and  to  test  the  validity  of  previous  conjectures  about 
long-term  evolution  or  the  stability  of  genetic  systems. 

Previously,  whenever  a new  genetic  polymorphism  was  discovered,  popula- 
tion geneticists  were  tempted  to  explain  it  in  terms  of  overdominance  or  some 
other  kind  of  balancing  selection.  This  was  natural  because  they  were  not 
acquainted  with  how  genes  really  changed  in  the  evolutionary  process. 
Recent  studies  on  DNA,  RNA,  or  protein  structures  indicate  that  genes 
have  almost  always  been  changing,  though  the  rate  of  change  is  very  slow. 
It  is  now  clear  that  the  genetic  structure  of  a population  never  stays  constant. 
A large  part  of  this  change  is  apparently  due  to  the  constantly  changing 
environment.  In  addition  to  the  geological  and  meteorological  change  of 
environment,  such  as  continental  drift  and  glaciation,  the  environment  of  a 
species  is  also  altered  by  biological  factors  such  as  emergence  of  new  species 
and  imbalance  of  food  chains.  In  fact,  the  biological  world  or  the  whole 
ecosystem  of  organisms  is  in  a state  of  never-ending  transformation.  Yet, 
an  equally  large  or  even  larger  part  of  the  change  of  genetic  structure  of 
populations  now  appears  to  be  of  random  nature  and  largely  irrelevant  to 
the  adaptation  of  organisms. 

Molecular  biology  has  also  changed  another  important  concept  in 
classical  population  genetics.  In  population  genetics  it  was  customary  to 
assume  that  there  are  only  a small  number  of  possible  allelic  states  at  a locus 
and  mutation  occurs  recurrently  forwards  and  backwards  between  these 
allelic  states  or  alleles.  At  the  molecular  level,  however,  a gene  or  cistron 
consists  of  about  1000  nucleotide  pairs.  Since  there  are  four  different  kinds 
of  nucleotides,  i.e.,  adenine,  thymine,  guanine,  and  cytosine,  the  number  of 
possible  allelic  states  is  4 1000  or  IQ602  (Wright,  1966).  In  practice,  a sub- 
stantial part  of  these  states  would  never  be  attained  because  the  functional 
requirement  of  the  gene  product  prohibits  certain  mutational  changes.  How- 
ever, even  a single  nucleotide  replacement  in  a cistron  of  1000  nucleotide 
pairs  can  produce  3000  different  kinds  of  alleles.  The  actual  number  of 
possible  allelic  states  must  be  much  larger  than  this.  Since  the  number  of 
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alleles  existing  in  any  population  is  quite  limited,  this  indicates  that  a new 
mutation  is  almost  always  different  from  the  alleles  preexisting  in  the 
population  (Kimura  and  Crow,  1964).  This  change  in  the  concept  of  muta- 
tion has  led  a number  of  authors,  notably  Kimura  (1971),  to  formulate  a new 
theory  of  population  genetics  at  the  molecular  level.  It  has  also  transformed 
some  of  the  old  theories  in  population  genetics.  For  example,  Wright's 
theory  of  inbreeding,  based  on  the  'fixed  allele  model',  can  now  be  regarded 
as  a special  case  of  a broader  theory  based  on  the  'variable  allele  model' 
(see  Nei,  1973a).  In  this  model  the  identity  of  genes  by  state  is  identical  to  the 
identity  of  genes  by  descent. 

The  crux  of  the  Darwinian  or  neo-Darwinian  theory  of  evolution  is 
natural  selection  of  the  fittest  individuals  in  the  population.  In  the  first  half 
of  this  century,  primarily  by  the  efforts  of  prominent  geneticists  and  evolu- 
tionists such  as  Fisher  (1930),  Haldane  (1932),  Wright  (1932),  Dobzhansky 
(1951),  Simpson  (1953),  and  Mayr  (1963),  a sophisticated  theory  of  evolution 
by  natural  selection  was  constructed.  In  this  theory  mutation  plays  a rather 
minor  role.  Modifying  King's  (1972)  summaries,  the  classical  view  of  neo- 
Darwinism  can  be  stated  as  follows: 

1)  There  is  always  sufficient  genetic  variability  present  in  any  natural 
population  to  respond  to  any  selection  pressure.  Mutation  rates  are  always 
in  excess  of  the  evolutionary  needs  of  the  species. 

2)  Mutation  is  random  with  respect  to  function. 

3)  Evolution  is  almost  entirely  determined  by  environmental  changes  and 
natural  selection.  Since  there  is  enough  genetic  variability,  no  new  mutations 
are  required  for  a population  to  evolve  in  response  to  an  environmental 
change.  There  is  no  relationship  between  the  rate  of  mutation  and  the  rate 
of  evolutionary  change. 

4)  Because  mutations  tend  to  recur  at  reasonably  high  rates,  any  clearly 
adaptive  mutation  is  certain  to  have  already  been  fixed  or  reached  its 
optimum  frequency  in  the  population.  Namely,  the  genetic  structure  of  a 
natural  population  is  always  at  or  near  its  optimum  with  respect  to  the 
'adaptive  surface'  in  a given  environment  (Wright,  1932). 

5)  Since  the  genetic  structure  of  a population  is  at  its  optimum,  and  since 
neutral  mutations  are  unknown,  virtually  all  new  mutations  are  deleterious, 
unless  the  environment  has  changed  very  recently. 

Some  of  the  above  statements  seem  to  be  still  true  at  the  level  of  morpho- 
logical and  physiological  evolution.  Natural  selection  plays  an  important 
role  in  adaptive  evolution.  However,  most  of  the  above  statements  do  not 
appear  to  be  warranted  at  the  level  of  molecular  evolution.  Questioning  of 
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the  above  statements  has  led  Kimura  (1968a)  and  King  and  Jukes  (1969) 
to  postulntc  the  neutral-mutation-random-drift  theory  of  evolution.  Ac- 
cording to  this  theory,  a majority  of  evolutionary  changes  of  macromolccules 
are  the  result  of  random  fixation  of  selectively  neutral  mutation.  On  the 
other  hand,  Oh  no  (1970)  postulated  that  natural  selection  is  nothing  but  a 
mechanism  to  preserve  the  established  function  of  a gene  and  evolution 
occurs  mainly  by  duplicate  genes  acquiring  new  functions.  These  views  have 
not  yet  been  widely  accepted  by  biologists,  but  at  least  at  the  molecular 
level  they  are  consistent  with  available  data.  Furthermore,  as  I shall  indicate 
later,  mutation  seems  to  be  more  important  than  neo-Darwinian  evolu- 
tionists have  thought  even  in  adaptive  evolution. 

Evolution  can  be  divided  into  two  phases,  i.e.,  chemical  and  organic 
evolution.  The  former  is  concerned  with  the  origin  of  life,  and  active  studies 
are  being  conducted  about  the  physical  and  chemical  conditions  under 
which  a life  or  self-perpetuating  substance  can  arise.  In  this  book,  however, 
we  shall  not  discuss  this  area.  We  will  be  mostly  concerned  with  organic 
evolution,  particularly  the  evolution  of  higher  organisms.  The  reader  who  is 
interested  in  chemical  evolution  may  refer  to  the  monographs  'Chemical 
Evolution'  by  Calvin  (1969)  and  'Molecular  Evolution  and  the  Origin  of 
Life'  by  Fox  and  Dose  (1972). 
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CHAPTER  2 


Evolutionary  history  of  life 


In  this  chapter  I would  like  to  discuss  a brief  history  of  life  just  to  outline 
the  time  scale  of  evolution.  Since  all  present  organisms  are  evolutionary 
products,  knowledge  of  evolution  is  important  in  any  study  on  genetic 
change  of  population. 


2.1  Evidence  from  paleontology  and  comparative 
morphology 

At  the  present  time  it  is  believed  that  the  earth  was  formed  about  4.5  billion 
years  ago.  It  is  not  known  exactly  when  the  first  life  or  self-replicating 
substance  was  formed.  Until  very  recently  the  fossils  from  the  early  geological 
time,  i.e.  the  Precambrian  era  (more  than  600  million  years  ago),  were 
almost  nonexistent.  The  recent  development  of  isotopic  methods  of  dating 
rocks,  however,  initiated  an  intensive  study  of  early  fossils.  In  1966  Barg- 
hoorn  and  Schopf  discovered  bacteria-like  fossils  in  the  Fig  Tree  Chert, 
a very  old  rock  from  South  Africa,  which  was  dated  about  3.1  billion  years 
old.  They  are  the  oldest  fossils  ever  discovered  on  the  earth.  This  organism 
was  named  Eobacterium  iso  latum.  This  discovery  suggests  that  life  originated 
more  than  3 billion  years  ago. 

The  second  oldest  microfossils  we  now  know  are  those  of  filamentous 
blue-green  algae  found  in  a dolomitic  limestone  stromatolite  in  South 
Africa  as  old  as  2.2  billion  years  (Nagy,  1974).  There  are  many  other  Pre- 
cambrian fossils,  but  most  of  them  are  the  fossils  of  microorganisms  (cf. 
Calvin,  1969).  The  oldest  fossil  of  nucleated  eukaryotic  cells  was  discovered 
by  Cloud  et  al.  (1969).  This  has  been  dated  1.2  ~ 1.4  billion  years  old. 

Fig.  2.1  is  a representation  of  the  geological  time  scale,  giving  a rough 
idea  of  chemical  and  organic  evolution.  There  are  rather  extensive  fossil 
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records  in  the  Cambrian  and  Postcambrian  periods,  and  the  major  evolu- 
tionary processes  in  these  geological  periods  can  be  reconstructed  from 
these  fossils.  The  fossils  in  the  early  Cambrian  period  show  that  most  living 
phyla  in  plants  and  animals  were  present  at  that  time.  This  indicates  that 
they  were  differentiated  before  the  Cambrian  period.  Despite  the  recent 
progress  in  the  paleontology  of  the  Precambrian  period,  the  fossil  records 
in  this  period  are  still  very  few  and  permit  no  detailed  study  of  evolution. 
Therefore,  evolution  in  the  Precambrian  period  can  only  be  inferred  from 
the  morphological,  embryological,  and  biochemical  studies.  Before  the 
development  of  molecular  biology,  morphological  and  embryological  studies 
were  very  useful  for  elucidating  the  phylogenetic  relationships  of  different 
organisms.  Using  this  method  of  comparative  morphology  and  paleonto- 
logical data,  the  classical  evolutionists  were  able  to  construct  reasonably 
good  phylogenetic  trees  of  different  groups  (orders)  of  plants  and  animals 
in  the  Cambrian  and  Postcambrian  periods.  These  phylogenetic  trees  are 
treated  in  many  classical  textbooks  of  evolution  (e.g.  Simpson,  1949),  so 
that  we  need  not  repeat  them  here.  For  our  present  purpose,  it  would  suffice 
to  give  an  abbreviated  tree  with  emphasis  on  vertebrate  animals  as  given  in 
fig.  2.2. 

2.2  Evidence  from  molecular  biology 

As  mentioned  above,  the  method  of  comparative  morphology  was  very 
useful  in  evolutionary  studies  when  fossil  records  were  lacking.  However, 
this  method  could  not  give  the  time  scale  of  evolution.  The  brilliant  progress 
of  molecular  biology  in  the  last  two  decades  has  provided  a new  method  for 
the  study  of  evolution.  The  basis  of  this  powerful  method  is  the  high  degree  of 
stability  of  nucleotide  sequences  in  DNA  (RNA  in  some  viruses).  The  evolu- 
tionary changes  of  nucleotide  sequences  are  so  slow,  that  they  provide  detailed 
information  about  their  origin  and  history.  Since  the  nucleotide  sequences 
in  structural  genes  of  DNA  are  translated  into  the  amino  acid  sequences  of 
proteins  through  the  genetic  code,  the  evolutionary  changes  of  amino  acid 
sequences  in  proteins  also  provide  information  about  the  process  and 
approximate  time  scale  of  evolution.  In  fact,  most  of  the  results  obtained 
through  studies  at  the  molecular  level  come  from  analyses  of  amino  acid 
sequences  of  certain  proteins.  The  estimation  of  evolutionary  time  by  this 
method  rests  on  the  discovery  that  the  rate  of  amino  acid  substitutions  per 


Evidence  from  molecular  biology 


11 


Table  2.1 

The  20  amino  acids  that  compose  proteins  and  their  three-  and  one-letter  abbreviations. 
The  abbreviations  arc  in  accordance  with  those  of  Dayhoff  (1969). 


Name 

Abbreviations 

Name 

Abbreviations 

Three- 

letter 

One- 

letter 

Three- 

letter 

One- 

letter 

1.  Alanine 

Ala 

A 

1 1 . Leucine 

Leu 

L 

2.  Arginine 

Arg 

k 

12.  Lysine 

Lys 

K 

3.  Asparagine 

Asn 

N 

13.  Methionine 

Met 

M 

4.  Aspartic  acid 

Asp 

r> 

14.  Phenylalanine 

Phe 

F 

5.  Cysteine 

Cys 

c 

15.  Proline 

Pro 

P 

6.  Glutamine 

Gin 

Q 

16.  Serine 

Ser 

S 

7.  Glutamic  acid 

Glu 

F 

17.  Threonine 

Thr 

T 

8.  Glycine 

Gly 

G 

18.  Tryptophan 

Trp 

W 

9.  Histidine 

His 

H 

19.  Tyrosine 

Tyr 

Y 

10.  Isoleucine 

lie 

I 

20.  Valine 

Val 

V 

year  per  site  in  a protein  is  roughly  constant  for  all  organisms.  Evidence  for 
this  will  be  examined  in  detail  in  eh.  8. 

There  are  20  different  amino  acids  that  compose  proteins.  The  names  and 
abbreviations  of  the  amino  acids  are  given  in  table  2.1.  The  chemical 
structures  of  these  amino  acids  can  be  found  in  any  textbook  of  biochemistry 
or  molecular  biology.  Some  proteins  are  composed  of  a single  polypeptide, 
a polymer  of  amino  acids  linked  together  by  peptide  bonds,  while  others 
consist  of  several  polypeptides  which  may  or  may  not  be  identical  with  each 
other.  Important  for  the  study  of  evolution  are  the  linear  arrangements  of 
amino  acids  in  these  polypeptides. 

Hemoglobin  A in  man  consists  of  two  a-chain  and  two  /1-chain  poly- 
peptides. In  fig.  2.3  the  amino  acid  sequence  in  the  a-chain  is  given  together 
with  those  from  horse,  bovine,  and  carp.  The  numbers  of  amino  acid  differ- 
ences between  these  a-chains  are  presented  in  table  2.2.  It  is  clear  that  the 
differences  between  fish  (carp)  and  mammals  (human,  horse,  and  bovine) 
are  much  larger  than  the  differences  among  mammals.  These  differences 
can  be  related  to  the  evolutionary  time  in  the  following  way. 

As  will  be  discussed  in  the  next  section,  all  organisms  on  this  planet  appear 
to  have  originated  from  a single  protoorganism.  Therefore,  speciation  must 
have  occurred  with  a high  frequency  in  the  evolutionary  process.  Genetic 
differentiation  between  a pair  of  species  starts  to  occur  as  soon  as  their 
primordial  populations  are  reproductively  isolated.  Let  t be  the  period  of 
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Fig.  2.3.  Amino  acid  sequences  in  the  u-chains  of  hemoglobins  in  four  vertebrate  species.  Amino  acids  are  expressed  in  terms  of  one-letter 
abbreviations.  The  hyphens  indicate  the  positions  of  deletions  or  additions. 
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Table  2.2 

Numbers  of  amino  acid  differences  between  hemoglobin  a-chains  from  human,  horse, 
bovine,  and  carp.  Deletions  and  additions  were  excluded  from  computation,  so  that  140 
amino  acids  were  compared.  The  figures  in  parentheses  are  the  proportions  of  different 
amino  acids.  The  values  given  below  the  diagonal  arc  the  estimates  of  average  number  of 
amino  acid  substitutions  per  site  between  two  species  (6). 


Human 

Horse 

Bovine 

Carp 

Human 

18(0.129) 

16(0.114) 

68(0.486) 

Horse 

0.138 

18(0.129) 

66(0.486) 

Bovine 

0.121 

0.138 

65(0.464) 

Carp 

0.666 

0.637 

0.624 

time  in  which  a pair  of  species  have  been  isolated.  Consider  a structural 
gene  which  codes  for  a polypeptide  composed  of  n amino  acids.  Since  an 
amino  acid  is  coded  for  by  triplet  nucleotides  or  a codon  in  DNA,  there  are 
3n  nucleotide  pairs  involved  in  this  gene.  Any  change  of  these  nucleotide 
pairs  is  a mutation,  but  it  does  not  necessarily  give  rise  to  amino  acid 
substitution  because  of  degeneracy  of  the  genetic  code  (see  eh.  3). 

Let  A be  the  rate  (probability)  of  amino  acid  substitution  per  year  at  a 
particular  amino  acid  site  and  assume  that  it  remains  constant  for  the  entire 
evolutionary  period.  This  assumption  is  only  roughly  correct  but  does  not 
affect  the  final  result  very  much.  The  mean  number  of  amino  acid  substitu- 
tions at  this  site  during  a period  of  t years  is  then  At,  and  the  probability  of 
occurrence  of  r amino  acid  substitutions  is  given  by 

Fit)  = (2.1) 

This  is  a simple  application  of  the  Poisson  process  in  probability  theory 
(Nei,  1969a;  see  Feller  (1957)  for  the  derivation).  In  particular,  p0(t)  = e~Xt, 
which  was  used  by  Zuckerkandl  and  Pauling  (1965)  and  Margoliash  and 
Smith  (1965)  in  predicting  the  evolutionary  change  of  hemoglobin  and 
cytochrome  c. 

Since  the  probability  that  amino  acid  substitution  does  not  occur  at  a 
particular  site  during  t years  is  e~x\  the  probability  that  neither  of  the 
homologous  sites  of  the  two  polypeptides  from  a pair  of  species  undergoes 
substitution  is  cT!Jr,  Therefore,  if  A is  the  same  for  all  amino  acid  sites,  the 
expected  number  of  identical  amino  acids  (n,)  between  the  two  polypeptides 

nt  = ne-21t  Q.  2) 

approximately.  This  formula  is  approximate  because  it  does  not  include 
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the  possibility  of  either  back  mutation  or  parallel  mutation  (the  same  amino 
acid  substitution  occurring  at  the  same  site  of  the  homologous  polypeptides). 
But  this  probability  is  generally  very  small  (Nei,  1971a).  A more  serious 
error  may  be  introduced  by  the  assumption  of  constancy  of  A for  all  sites, 
which  is  certainly  not  true.  This  error  is,  however,  known  to  be  small  unless 
the  variance  of  A is  very  large. 

At  any  rate,  under  the  above  assumption  6 = 2Al  can  be  estimated  by 

» = - log,/,,.  (2.3) 

where  ia  = n-Jn,  while  the  variance  of  6 is 

v*  “ t!  - Qim  (2-4) 

approximately.  If  6 is  estimated  for  two  different  pairs  of  species,  the  relative 
evolutionary  time  (T)  of  one  pair  to  the  other  can  be  obtained.  Namely, 

T = i,IS„  (2.5) 

where  6,  and  52  are  the  values  of  6 for  the  first  and  the  second  pairs  of 
species.  Furthermore,  if  t is  known,  A may  be  estimated  by  5/(2t).  On  the 
other  hand,  if  A is  known,  t may  be  estimated  by  <5/(22). 

In  table  2.2  the  estimates  of  6 are  given  for  six  pairs  of  species  together 
with  n - ni  and  1 - ) The  average  value  of  6's  for  the  pairs  of  mammalian 
species  is  0.132,  while  the  average  for  the  pairs  of  carp  and  mammalian 
species  is  0.642.  Therefore,  the  relative  evolutionary  time  of  fish  to  that  of 
mammals  is  estimated  to  be  4.9.  On  the  other  hand,  geological  data  suggest 
that  fish  evolved  350  ~ 400  million  years  ago  while  the  divergence  of  mam- 
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Average  numbers  of  amino  acid  differences  between  cytochromes  c from  different  groups 
of  animals  (McLaughlin  and  Dayhoff,  1970).  These  are  averages  of  from  1 to  51  com- 
parisons of  sequences  of  about  108  amino  acids,  including  the  deletions  and  additions. 
The  figures  in  parentheses  are  the  average  numbers  of  amino  acid  differences  divided  by 
94  (14  amino  acid  sites  are  believed  to  be  'immutable').  The  values  of  d are  given  below  the 
diagonal. 
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malian  species  occurred  about  75  ~ 80  million  years  ago  (fig.  2.2),  the 
relative  evolutionary  time  of  fish  to  that  of  mammals  being  about  five 
times.  Thus,  the  molecular  data  agree  quite  well  with  the  geological  data. 

In  table  2.3  the  average  numbers  of  amino  acid  differences  between 
cytochromes  c from  animals,  plants,  fungi,  and  prokaryotes  (bacteria)  are 
given.  The  average  number  of  amino  acids  per  sequence  used  for  com- 
parisons was  about  108.  Cytochrome  c is  believed  to  have  about  14  'im- 
mutable' sites,  at  which  amino  acid  substitution  destroys  the  function  of  the 
protein.  Excluding  these  14  amino  acid  sites,  we  can  compute  the  values  of 
6 for  all  pairs  of  the  above  groups  of  organisms.  They  are  presented  in 
table  2.3.  It  is  clear  that  animals,  plants,  and  fungi  (all  are  eukaryotes)  were 
differentiated  almost  at  the  same  time,  while  the  divergence  between  pro- 
karyotes and  eukaryotes  occurred  much  earlier.  The  divergence  time 
between  prokaryotes  and  eukaryotes  is  estimated  to  be  about  twice  as  large 
as  the  divergence  time  among  animals,  plants,  and  fungi. 

The  above  estimates  of  divergence  time  roughly  agree  with  that  obtained 
by  McLaughlin  and  Dayhoff  (1970)  using  a different  statistical  method.  They 
obtained  6,  = 0.58  between  the  animal  and  plant  kingdoms  and  Sz  = 1.37 
between  the  prokaryotes  and  eukaryotes.  They  also  studied  the  nucleotide 
differences  of  four  different  transfer  RNA’s  (tRNA’s)  within  and  between 
prokaryotes  and  eukaryotes,  estimating  that  the  divergence  of  prokaryotes 
and  eukaryotes  was  about  2.6  (=  <52/<5i)  times  earlier  than  the  divergence 
between  plants  and  animals.  This  value,  however,  seems  to  be  an  over- 
estimate. Kimura  and  Ohta  (1973a)  reanalyzed  the  same  tRNA  data  and 
obtained  S2ll = 1-99.  Furthermore,  a similar  analysis  of  5 S RNA  data  by 
these  authors  gave  an  estimate  of  = 1.46.  Therefore,  it  seems  that  the 
divergence  of  prokaryotes  and  eukaryotes  was  1.5  to  2 times  earlier  than  the 
divergence  between  plants  and  animals.  As  will  be  seen  in  ch.  8 (fig.  8.3), 
the  divergence  time  between  plants  and  animals  has  been  estimated  to  be 
1200  million  years.  Thus,  the  divergence  between  prokaryotes  and  eukaryotes 
seems  to  have  occurred  roughly  2 x I0g  years  ago  (Kimura  and  Ohta, 
1973a).  This  conclusion  is  in  agreement  with  fossil  records  if  the  microfossils 
(about  2 x 109  years  old)  recently  discovered  by  Hofmann  (1974)  are  those 
of  eukaryotes. 

The  divergence  of  prokaryotes  and  eukaryotes  can  be  related  to  an  even 
earlier  event  in  a very  primitive  organism,  i.e.  the  development  of  the  genetic 
code.  Comparison  of  the  nucleotide  sequences  between  tRNA’s  transporting 
different  amino  acids  suggests  that  they  originated  from  a common  proto- 
tRNA  which  acted  as  a nonspecific  catalyst,  polymerizing  amino  acids  by  a 
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mechanism  similar  to  the  one  still  used  today.  For  example,  McLaughlin 
and  Dayhoff  (1970),  using  the  nucleotide  sequence  data,  showed  that  valine 
and  tyrosine  tRNA  differ  at  25.1  sites  out  of  58  on  the  average.  This  high 
degree  of  similarity  strongly  suggests  that  the  two  tRNA’s  developed  from  a 
common  origin.  The  similarities  of  the  nucleotide  sequences  of  the  same 
tRNA  between  prokaryotes  and  eukaryotes  are  slightly  higher  than  those 
between  different  tRNA.  From  these  studies,  McLaughlin  and  Dayhoff 
concluded  that  the  evolution  of  tRNA  occurred  about  1.2  times  earlier  than 
the  divergence  of  prokaryotes  and  eukaryotes. 

As  mentioned  above,  the  data  on  amino  acid  sequences  of  proteins  and 
nucleotide  sequences  of  nucleic  acids  provide  useful  information  on  organic 
evolution.  Since,  however,  the  determination  of  amino  acid  sequences  and 
nucleotide  sequences  is  not  simple,  only  a few  proteins  and  nucleic  acids 
from  a limited  number  of  species  have  been  analyzed  for  this  purpose. 
Therefore,  our  picture  on  Precambrian  evolution  may  well  change  in  the 
future.  On  the  other  hand,  data  on  amino  acid  sequences  of  proteins  is  of 
little  use  in  the  study  of  evolution  at  the  species  or  subspecies  level,  unless 
a large  number  of  proteins  are  sequenced.  This  is  because  the  rate  of  amino 
acid  substitutions  per  site  per  year  is  so  small,  that  closely  related  species 
often  share  a protein  of  the  same  amino  acid  sequence.  For  example,  there 
is  no  difference  in  the  amino  acid  sequences  of  the  or-  and  ^-chains  of  hemo- 
globin between  man  and  chimpanzee.  Therefore,  they  cannot  be  used  for 
estimating  the  divergence  time  between  man  and  chimpanzee.  In  the  study 
of  species  or  subspecies  evolution,  however,  data  on  protein  identity  detected 
by  electrophoresis  can  be  used,  as  will  be  discussed  in  ch.  7.  The  genetic 
relatedness  between  two  different  organisms  can  also  be  studied  by  such 
techniques  as  DNA  hybridization  and  immunological  reaction  (ch.  8). 


2.3  Biochemical  unity  of  life 

There  are  about  1.5  million  different  species  of  organisms  living  on  this 
earth,  including  all  prokaryotes  and  eukaryotes.  The  basic  metabolic 
processes  of  all  these  organisms  are  very  similar.  It  is,  therefore,  considered 
that  all  organisms  have  originated  from  a common  protoorganism  which 
probably  existed  about  3.5  billion  years  ago.  Dayhoff  and  Eck  (1969)  list 
the  following  common  features  of  metabolisms: 

1)  All  cells  utilize  polyphosphates,  particularly  adenosine  phosphate,  for 
energy  transfer.  These  polyphosphates  are  manufactured  in  photosynthesis 
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or  in  the  oxidation  of  stored  food.  Their  decomposition  is  coupled  to  the 
organic  synthesis  of  thermodynamically  unstable  products  needed  by  the  cell. 

2)  Cells  synthesize  and  store  similar  compounds  - fats,  carbohydrates, 
and  proteins  - using  similar  reaction  pathways.  These  compounds  are 
degraded  with  release  of  energy  in  a similar  way  in  most  cells. 

3)  The  metabolic  reactions  are  catalyzed  largely  by  proteins,  which  arc 
linear  polymers  of  twenty  amino  acid  building  blocks.  A number  of  these 
proteins  have  identifiable  counterparts,  known  as  homologues,  in  most 
organisms.  The  homologous  proteins  often  have  similar  amino  acid 
sequences,  functions,  and  three-dimensional  structures. 

4)  Proteins  are  manufactured  in  the  cell  by  a complex  coding  process. 
The  machinery  of  protein  synthesis  is  the  same  for  all  organisms. 

5)  There  are  a few  ubiquitous,  small  compounds  which  take  part  in 
metabolic  processes  and  which  include  nicotinamide,  pyridoxal,  glutathione, 
the  flavinoids,  the  carotenes,  the  heme  groups,  the  isoprenoid  compounds, 
and  iron  sulfide.  Since  there  are  millions  of  possible  compounds  of  com- 
parable size  and  energy,  it  seems  most  unlikely  that  these  particular  ones 
would  have  been  chosen  independently  by  different  organisms. 

All  the  above  common  features  of  cell  metabolisms  support  the  theory  of 
common  origin  of  all  organisms  on  this  earth.  It  is  almost  impossible  that 
so  many  things  have  originated  independently  in  different  organisms  by 
chance.  I have  already  indicated  that  the  number  of  ways  in  which  the 
sequence  of  1000  nucleotides  of  DNA  can  be  produced  is  about  ]Q6DJ. 
Therefore,  it  is  extremely  improbable  that  two  unrelated  organisms  would  by 
chance  have  selected  and  manufactured  two  structures  with  a degree  of 
similarity  as  great  as  that  observed. 
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The  scientific  study  of  evolution  started  from  Darwin  and  Wallace's  paper 
published  in  1858.  They  first  postulated  that  evolution  has  occurred  largely 
as  a result  of  natural  selection.  Natural  selection  is  effective  only  when  there 
is  genetic  variation,  and  this  genetic  variability  is  provided  primarily  by 
mutation.  At  the  time  of  Darwin,  it  was  not  known  how  genetic  variation 
arises.  Without  knowledge  of  the  laws  of  inheritance,  which  were 
discovered  by  Mendel  in  1865  but  buried  for  35  years,  Darwin  believed  in 
the  inheritance  of  acquired  characters  to  some  extent. 

The  theory  of  mutation  or  spontaneous  origin  of  new  genetic  variation 
was  first  formulated  by  de  Vries  in  1901.  He  postulated  that  occasionally 
new  genetic  variation  occurs  by  some  unknown  factor  and  this  immediately 
leads  to  a new  species.  Although  the  origin  of  new  species  by  a single 
mutation  later  proved  to  be  wrong,  the  spontaneous  origin  of  new  genetic 
variation  was  supported  by  many  subsequent  works. 

In  early  days  any  genetic  change  of  phenotypes  was  called  mutation 
without  knowing  the  cause  of  the  change.  At  present,  we  know  that  various 
factors  are  involved  in  causing  genetic  changes  of  phenotypes.  They  can 
be  studied  at  three  different  levels,  i.e.  molecular,  chromosomal,  and  genome 
levels.  In  this  chapter  we  shall  briefly  review  mutational  mechanisms  at  the 
molecular  level.  The  reader  may  refer  to  Drake's  (1970)  book  for  details. 


3.1  The  basic  process  of  gene  action 

All  the  morphological  and  physiological  characters  of  organisms  are  con- 
trolled by  the  genetic  information  carried  by  deoxyribonucleic  acid  (DNA) 
molecules,  which  are  transmitted  from  generation  to  generation.  In  some 
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viruses  genetic  information  is  carried  by  ribonucleic  acid  (RNA)  rather  than 
DNA,  but  the  essential  feature  of  inheritance  of  characters  is  the  same.  The 
genetic  information  carried  by  DNA  is  manifested  in  enzymatic  or  structural 
proteins,  which  are  macromolecules  essential  for  the  morphogenesis  and 
physiology  of  all  organisms.  In  the  process  of  development  the  genetic 
information  contained  in  the  nucleotide  sequence  of  DNA  is  first  transferred 
to  the  nucleotide  sequence  of  messenger  RNA  (mRNA)  by  a simple  process 
of  one-for-one  transcription  of  the  nucleotides  in  the  DNA.  By  the  same 
process,  transfer  RNA  (tRNA)  and  ribosomal  RNA  (rRNA)  are  produced. 
The  genetic  information  transferred  to  mRNA  now  determines  the  sequence 
of  amino  acids  of  the  protein  which  will  be  synthesized.  Nucleotides  of 
mRNA  are  read  sequentially,  three  at  a time.  Each  such  triplet  or  codon 
is  translated  into  one  particular  amino  acid  in  the  growing  protein  chain 
through  the  genetic  code  (table  3.1).  The  synthesis  of  proteins  occurs  in 
ribosomes  with  the  aid  of  transfer  RNA.  Ribosomes  are  composed  of  rRNA 
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The  genetic  code. 
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NS:  Nonsense  or  chain  terminating  codon. 
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and  proteins.  Therefore,  any  of  the  mutations  which  are  recognized  as 
morphological  or  physiological  changes  must  be  due  to  sonic  change  of 
DNA  molecules. 

3.2  Types  of  changes  in  DNA 

There  arc  four  basic  types  of  changes  in  DNA.  They  are  replacement  of  a 
nucleotide  by  another  (fig.  3.1b),  deletion  of  nucleotides  (fig.  3.1c),  addition 
of  nucleotides  (fig.  3. Id),  and  inversion  of  nucleotides  (fig.  3.1e).  Addition, 
deletion,  and  inversion  may  occur  with  one  or  more  nucleotides  as  a unit. 
Addition  and  deletion  may  shift  the  reading  frames  of  the  nucleotide 
sequence.  In  this  case  they  are  called  frameshift  mutation.  Replacements  of 
nucleotides  can  be  divided  into  two  different  classes,  i.e.  transition  and 
transversion  (Freese,  1959).  Transition  is  the  replacement  of  apurine  (adenine 
or  guanine)  by  another  purine  or  of  a pyrimidine  (thymine  or  cytosine)  by 
another  pyrimidine.  Other  types  of  nucleotide  substitutions  are  called 
transversion. 

The  first  molecular  model  for  the  origin  of  spontaneous  mutations  was 
proposed  by  Watson  and  Crick  (1953).  The  four  nucleotide  bases  can  form  a 


(a)  Wild  type 
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Fig.  3.1.  An  illustration  of  the  four  basic  types  of  changes  in  DNA.  The  base  sequence  is 
represented  in  units  of  codons  or  nucleotide  triplets  in  order  to  show  how  the  amino  acids 
coded  for  are  changed  by  the  nucleotide  changes. 
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tautomeric  shift  of  a hydrogen  atom  with  a small  probability  and  make  a 
pairing  mistake.  For  example,  adenine  may  pair  with  cytosine  instead  of 
thymine.  This  type  of  mispairing  almost  always  occurs  between  a purine 
and  a 'wrong  pyrimidine'  or  a pyrimidine  and  a 'wrong  purine'.  If  these 
mispairings  occur  at  the  time  of  DNA  replication,  mutations  may  arise. 
Namely,  if  a base  of  the  template  strand  of  DNA  is  in  the  state  of  shifted 
tautomery  at  the  moment  that  the  growing  end  of  the  complementary  new 
strand  reaches  it,  a wrong  nucleotide  can  be  added  to  the  growing  end. 
Similarly,  if  the  base  of  a nucleotide  triphosphate  is  in  the  shifted  state,  it 
may  be  added  to  the  growing  end  of  a new  strand.  These  events  will  always 
give  rise  to  transition  mutations.  Freese  (1959)  extended  this  model  and 
suggested  that  transversions  may  arise  by  a similar  mechanism  when  errors 
of  pairing  occur  between  two  purines  or  two  pyrimidines.  His  data  on 
mutations  in  phage  T4  indicate  that  transversions  are  more  frequent  than 
transitions.  Vogel  (1972)  studied  the  frequencies  of  transitions  and  trans- 
versions in  abnormal  hemoglobins  in  man.  He  concluded  that  transitions 
are  more  frequent  than  expected  under  the  assumption  that  nucleotide 
replacements  occur  at  random,  though  the  absolute  frequency  of  trans- 
versions is  higher  than  that  of  transitions. 

The  above  model  explains  only  replacement  mutations.  There  are  several 
other  models  which  can  explain  deletion,  addition,  and  inversion  as  well  as 
replacement,  but  none  of  them  has  been  confirmed  experimentally.  A large 
part  of  deletion,  insertion,  and  frameshift,  however,  seems  to  be  due  to 
unequal  crossing  over.  Magni  (1969)  has  shown  that  the  rate  of  frameshift 
mutations  at  meiosis  is  about  30  times  higher  than  that  at  mitosis  in  yeast, 
while  the  rate  of  missense  and  nonsense  mutations  is  almost  the  same  for 
both  meiotic  and  mitotic  divisions. 


3.3  Mutations  and  amino  acid  substitutions 


The  genes  or  segments  of  DNA  molecules  that  act  as  templates  of  mRNA’s 
are  called  structural  genes.  Since  the  amino  acid  sequence  in  a polypeptide 
is  determined  by  the  nucleotide  sequence  of  a structural  gene,  any  change  in 
amino  acid  sequences  is  caused  by  the  mutation  occurring  in  DNA.  On  the 
other  hand,  a mutational  change  of  DNA  is  not  necessarily  reflected  in 
change  of  amino  acid  sequence.  This  is  because  there  is  degeneracy  in  the 
genetic  code  (synonymy  of  codes).  For  example,  both  ATA  and  ATG  codons 
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of  DNA  (UAU  and  UAC  codons  of  mRNA,  respectively)  code  for  tyrosine, 
so  that  the  change  of  A to  G in  the  third  base  of  AT  A codon  does  not 
produce  any  effect  on  the  amino  acid  sequence  (cf.  table  3.1). 

The  genetic  code  for  mRNA  is  given  in  table  3.1.  There  are  64  different 
codons  but  only  20  different  amino  acids  are  coded.  The  three  nonsense 
codons  in  table  3.1  are  those  at  which  the  amino  acid  sequence  of  a poly- 
peptide is  terminated.  A mutation  which  results  in  one  of  these  three 
nonsense  codons  is  called  a nonsense  mutation,  while  a mutational  change  of 
one  amino  acid  codon  to  another  amino  acid  codon  is  called  a missense 
mutation. 

Let  us  now  determine  the  percentage  of  nucleotide  replacements  in  DNA 
that  can  be  detected  by  amino  acid  changes  by  using  the  genetic  code  table. 
For  this  purpose,  we  need  the  following  assumptions.  1)  The  64  different 
codons  are  equally  frequent  in  the  genome  of  an  organism.  2)  The  probability 
of  nucleotide  replacement  is  the  same  for  all  bases  of  DNA.  The  validity  of 
these  assumptions  will  be  discussed  later.  Under  the  present  assumptions  the 
relative  frequency  of  the  substitution  of  one  amino  acid  by  another  is 
proportional  to  the  possible  number  of  single-base-replacements  that  give 
rise  to  the  amino  acid  substitution.  Table  3.2  shows  the  relative  frequencies 
of  various  amino  acid  substitutions  thus  obtained,  including  nonsense 
codons.  There  are  549  (=  576  - 27)  possible  mutations  from  61  different 
amino  acid  codons.  Of  these,  415  result  in  amino  acid  substitutions  or  in 
nonsense  mutations.  Therefore,  about  76  percent  of  nucleotide  substitutions 
can  be  detected  by  examining  amino  acid  changes.  In  other  words,  about 
24  percent  of  nucleotide  substitutions  result  in  synonymous  codons,  so  that 
they  do  not  affect  the  amino  acid  sequence  of  a polypeptide  at  all.  In  the  above 
computation  all  nonsense  mutations  were  included.  There  are  23  possible 
mutations  that  result  in  nonsense  codons.  Therefore,  if  these  are  excluded, 
the  probability  that  a nucleotide  substitution  results  in  the  substitution  of 
one  amino  acid  by  another  is  0.714. 

All  the  computations  made  above  depend  on  the  two  assumptions  men- 
tioned earlier.  The  first  assumption  that  the  64  different  codons  are  equally 
frequent  in  the  genome  of  an  organism  presupposes  that  the  frequencies  of 
the  four  nucleotides  A,  T,  G,  and  C,  are  equally  frequent.  Namely,  the  G-C 
content  (relative  frequency  of  G and  C)  must  be  50  percent.  In  reality,  the 
G-C  content  greatly  varies  with  organism  (Sueoka,  1962).  In  vertebrates, 
however,  the  G-C  content  is  remarkably  constant  and  ranges  only  from  40 
to  44  percent.  Kimura  (1968b)  studied  the  frequencies  of  various  codons 
expected  under  random  combination  of  nucleotides,  noting  that  the  relative 
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frequencies  of  A,  T,  G,  and  C in  vertebrates  are  roughly  0.285,  0.285,  0.215, 
and  0.215,  respectively.  The  comparison  of  the  expected  and  observed 
frequencies  of  amino  acids  in  proteins  has  shown  that  the  agreement  between 
the  two  is  quite  satisfactory  as  a crude  approximation.  He  then  computed 
the  probability  that  a mutation  is  synonymous.  It  was  0.23.  This  value  is  very 
close  to  our  previous  estimate,  0.24.  Therefore,  at  least  in  vertebrates,  the 
first  assumption  appears  to  hold  approximately. 

The  second  assumption  that  the  probability  of  nucleotide  replacement  is 
the  same  for  all  bases  also  does  not  appear  to  be  true,  strictly  speaking. 
Benzer  (1955)  has  shown  that  the  differences  in  mutation  rate  among  different 
nucleotide  sites  in  the  r-TT  gene  of  phage  T4  are  enormous,  although  most  of 
the  mutations  he  studied  are  conditional  lethals  and  exclude  neutral  or 
advantageous  mutations.  Data  on  the  amino  acid  substitutions  in  the 
evolutionary  process  also  indicate  that  the  probability  of  nucleotide  replace- 
ment is  not  the  same  for  all  DNA  bases  (ch.  8).  Nevertheless,  our  result 
about  the  probability  of  synonymous  mutation  seems  to  be  roughly  correct 
if  we  exclude  those  codons  at  which  nucleotide  replacement  rarely  occurs. 

Amino  acid  sequencing  requires  a large  quantity  of  purified  protein, 
which  is  not  always  easy  to  obtain.  A quick  method  of  detecting  amino 
acid  substitution  in  a protein  is  to  examine  the  electrophoretic  mobility  of 
protein  in  a gel.  This  method  is  now  being  used  extensively  in  detecting 
protein  variations  in  natural  populations.  The  electrophoretic  mobility  of  a 
protein  is  largely  determined  by  the  net  charge  of  the  protein.  Let  us  now 
determine  the  probability  that  an  amino  acid  substitution  results  in  a net 


Table  3.3 

Relative  frequencies  of  amino  acid  substitutions  resulting  in  a charge  change  of  a protein. 
From  Nei  and  Chakraborty  (1973). 
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n,  +,  and  — refer  to  'neutral',  'positive',  and  'negative',  respectively. 

Obtained  from  the  genetic  code  table;  the  total  number  of  base  changes  which  give  rise 
to  amino  acid  substitutions  is  392. 


t Obtained  from  the  empirical  data  on  amino  acid  substitutions  (Dayhoff,  1969);  the 
total  number  of  amino  acid  substitutions  used  is  790. 
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charge  change  of  a protein.  At  the  ordinary  pH  value  at  which  electro- 
phoresis is  conducted,  lysine  and  arginine  are  positively  charged,  while 
aspartic  acid  and  glutamic  acid  are  negatively  charged.  Other  amino  acids 
are  all  neutral.  From  table  3.2,  we  can  compute  the  expected  relative  fre- 
quencies of  various  types  of  charge  changes  of  a protein.  The  results  obtained 
are  given  in  table  3.3,  together  with  the  empirical  frequencies  which  have 
occurred  in  such  proteins  as  hemoglobin,  cytochrome  c,  myoglobin,  virus 
coat  protein,  etc.,  in  the  actual  evolutionary  process.  It  is  seen  that  the  total 
probability  of  charge  change  of  protein  is  roughly  0.25  ~ 0.3.  Tn  the  study 
of  evolution  or  protein  polymorphism  the  empirical  value  would  be  more 
meaningful  than  the  theoretical.  In  this  book  we  shall  use  0.25  as  the 
detectability  of  protein  differences.  It  must  be  kept  in  mind,  however,  that 
electrophoretic  mobility  of  a protein  is  also  affected  by  its  tertiary  structure, 
the  location  of  charged  amino  acids  in  protein  sequences,  etc.  Therefore,  the 
above  estimate  may  well  be  corrected  in  the  future. 

Recently,  Bernstein  et  al.  (1973)  reported  that  the  detectability  of  protein 
differences  may  be  increased  by  heat  treatment  of  proteins  before  electro- 
phoresis. In  the  case  of  xanthine  dehydrogenase  in  Drosophila  the  detectability 
was  doubled  by  this  method. 


3.4  Effects  on  fitness 

The  population  dynamics  of  a mutant  gene  is  largely  determined  by  its 
effect  on  the  fitness  of  an  individual.  Therefore,  it  is  important  to  know  the 
effect  on  fitness  of  each  mutant  gene  or  the  frequency  distribution  of  fitnesses 
of  new  mutations.  This  is  a very  difficult  task,  however,  since  the  fitness  of  an 
individual  clearly  depends  on  the  environment  in  which  the  individual  is 
placed  and,  even  in  a given  environment,  fitness  is  composed  of  many 
components,  such  as  viability,  mating  ability,  fertility,  etc.  Furthermore,  to 
detect  a small  effect  on  fitness,  an  enormous  number  of  individuals  must  be 
tested.  The  present  estimates  of  the  distributions  of  fitnesses  are  largely  based 
on  conjectures  and  personal  preferences.  Thus,  in  a symposium  on  'Dar- 
winian, Neo-Darwinian,  and  Non-Darwinian  Evolution',  Crow  (1972),  King 
(1972),  and  Bodmer  and  Cavalli-Sforza  (1972)  produced  several  different 
hypothetical  distributions.  One  common  feature  of  these  distributions  is  the 
highest  frequency  of  neutral  or  nearly  neutral  mutations.  From  a statistical 
study  of  hemoglobin  mutations,  however,  Kimura  and  Ohta  (1973b) 
concluded  that  deleterious  mutations  are  about  ten  times  more  frequent 
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than  neutral  or  nearly  neutral  mutations,  neglecting  synonymous  mutations 
at  the  codon  level. 

Strictly  speaking,  the  fitness  effect  of  a mutation  should  be  determined  by 
a careful  population  genetics  experiment,  but  some  aspects  of  mutational 
effects  can  be  inferred  by  looking  at  the  molecular  structure  of  genes  or 
proteins  produced.  As  discussed  by  Freese  (1  962),  Kimura  (1968b),  and 
King  and  Jukes  (1969),  certain  classes  of  mutations  seem  to  be  selectively 
neutral  at  the  molecular  level.  The  first  candidates  of  such  mutations  are 
synonymous  mutations.  Although  there  is  some  argument  against  neutrality 
of  synonymous  mutations  (Richmond,  1970),  the  prevalence  of  such 
mutations  in  the  evolutionary  process  suggests  that  they  are  virtually  neutral. 
We  have  shown  that  the  expected  frequency  of  synonymous  mutations  is  as 
high  as  24  percent  of  the  total  nucleotide  replacements.  Of  course,  this  class 
of  mutations  is  expected  to  have  little  effect  on  any  phenotypic  character, 
though  they  may  affect  the  subsequent  course  of  evolution.  The  second  class 
of  neutral  mutations  is  composed  of  nonfunctional  genes.  Higher  organisms 
seem  to  carry  a large  number  of  nonfunctional  genes,  as  will  be  discussed 
later.  An  obvious  example  of  this  class  of  DNA  is  that  of  constitutive 
heterochromatin,  a large  part  of  which  is  apparently  nonfunctional.  Muta- 
tions occurring  in  this  type  of  DNA  would  be  essentially  neutral,  though 
they  again  have  little  effect  on  phenotypic  characters. 

A certain  proportion  of  the  mutations  that  result  in  amino  acid  replace- 
ments in  proteins  could  also  be  selectively  neutral.  We  have  seen  that  the 
amino  acid  sequences  of  hemoglobin  and  cytochrome  c vary  considerably 


Table  3.4 


Human  hemoglobin  variants  which  correspond  to  mutations  that  have  become  incor- 
porated into  the  normal  hemoglobins  of  other  species.  From  King  and  Jukes  (1969). 
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with  organism.  Namely,  different  mutations  have  been  fixed  in  different 
organisms.  Yet,  it  has  been  shown  that  the  cytochromes  c from  various 
organisms  are  fully  interchangeable  in  in  vitro  tests  of  reaction  with  substrates 
(Dickerson,  1971).  Although  this  is  not  necessarily  the  proof  of  neutral  or 
nearly  neutral  gene  substitutions,  it  indicates  that  there  are  many  different 
forms  of  alleles  that  are  virtually  identical  in  function.  The  replacement  of  an 
amino  acid  by  another  with  similar  properties  at  nonactive  sites  seems  to 
result  in  no  disturbance  of  protein  function  (Smith,  1968,  1970).  In  most 
proteins  there  are  many  such  possible  amino  acid  replacements  (King  and 
Jukes,  1969).  In  recent  years  a large  number  of  hemoglobin  variants  have 
been  discovered  in  man.  Amino  acid  replacements  found  in  some  of  these 
variants  apparently  do  not  disturb  the  hemoglobin  function,  since  the  same 
mutations  have  been  fixed  in  other  organisms  (table  3.4).  (See,  however, 
the  concept  of  covarions  in  ch.  8.) 


3.5  Rate  of  spontaneous  mutation 

Before  the  development  of  molecular  genetics,  geneticists  had  established 
that  the  rate  of  spontaneous  mutations  per  locus  is  of  the  order  of  10“  5 per 
generation  in  many  higher  organisms  such  as  fruitfly,  corn,  and  man.  These 
estimates  were  obtained  from  studies  of  the  changes  of  morphological  or 
physiological  characters,  including  lethal  mutations.  The  mutations  identified 
in  this  way  possibly  included  some  small  chromosomal  aberrations,  while 
the  mutations  which  do  not  change  the  phenotype  drastically  were  not 
included.  Mutations  can  now  be  studied  at  the  molecular  level,  but  still 
very  little  is  known  about  the  rate  of  nucleotide  changes  per  locus. 

The  mutation  rates  so  far  estimated  in  microorganisms  are  based  on 
essentially  the  same  principle  as  that  in  higher  organisms.  That  is,  mutations 
are  identified  by  inability  to  produce  some  biochemical  substances  that  are 
present  in  the  wild-type  strain.  For  technical  reasons,  back  mutations  are 
often  used  to  determine  the  rate  of  mutation.  The  mutation  rates  determined 
with  microorganisms  are  considered  to  be  more  accurate  than  those  in  higher 
organisms,  because  biochemically  less  complicated  characters  are  used  and  a 
large  number  of  offspring  can  be  tested.  Table  3.5  shows  some  of  the  esti- 
mates of  mutation  rates  in  the  bacterium  Escherichia  coli.  It  is  clear  that  the 
mutation  rate  greatly  varies  with  locus.  Part  of  the  variation  in  mutation 
rate  among  loci  may  be  due  to  the  difference  in  the  number  of  nucleotide 
pairs  within  a gene.  Watson  (1965)  has  estimated  that  the  replication  error 
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Rates  of  spontaneous  mutation  in  Escherichia  coli.  From  Ryan  (1963). 


Phenotypic  and  genotypic  change 

Mutation  rate  per  cell 
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To  convert  these  to  a rate  per  gene  would  require  dividing  by  the  number  of  times  a gene 
is  present  per  cell,  a number  of  the  order  of  4. 


at  the  nucleotide  level  is  about  10“  9.  If  a gene  consists  of  1000  nucleotide 
pairs,  this  corresponds  to  a mutation  rate  of]CTs  per  gene  per  replication. 
This  estimate  is,  however,  very  crude,  and  the  exact  rate  of  mutation  per 
nucleotide  replication  remains  to  be  determined. 

In  recent  years  a large  number  of  abnormal  hemoglobins  have  been 
discovered.  The  list  of  abnormal  hemoglobins  made  by  Hunt  et  al.  (1972) 
includes  47  different  kinds  of  single  amino  acid  substitutions  in  the  a-chain 
and  80  different  kinds  in  the  /1-chain.  Almost  all  of  these  were  detected  by 
electrophoresis.  Theoretically,  there  are  about  900  different  kinds  of  mutants 
that  result  from  a single  nucleotide  replacement  in  both  a-  and  /1-chains.  If 
only  1/4  of  amino  acid  replacements  are  detectable  by  electrophoresis,  about 
1/5  of  the  detectable  a-chain  and  1/3  of  the  detectable  /1-chain  variants  have 
been  discovered. 

Kimura  and  Ohta  (1973b)  estimated  the  mutation  rate  from  the  frequency 
of  these  hemoglobin  variants.  The  data  used  are  those  of  Yanase  et  al.  (1968) 
and  luchi  (1968).  These  authors  discovered  altogether  44  electrophoretically 
different  variants  of  the  a-  and  /1-chains  represented  in  62  individuals  in 
surveys  of  about  320,000  individuals.  Since  these  variants  are  all  represented 
in  heterozygous  condition  and  only  one  third  of  the  variants  are  detected 
by  electrophoresis,  the  gene  frequency  of  abnormal  hemoglobins  is  estimated 
to  be  about  3 x IQ-4.  Hanada  (see  Kimura  and  Ohta,  1973b)  examined  the 
hemoglobins  of  the  parents  of  18  variant  individuals  and  found  that  two 
of  the  18  cases  are  new  mutations.  Thus,  the  fraction  of  new  mutations  is  1/9. 
The  mutation  rate  for  the  hemoglobin  a-  and  /I-chains  is  then  estimated  to 
be  3.3  x 10”  s.  Since  the  a-  and  /1-chains  consist  of  141  and  146  amino  acids, 
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respectively,  the  mutation  rate  per  codon  becomes  10'"7  per  generation. 
Furthermore,  if  we  note  that  the  probability  of  a nucleotide  replacement 
resulting  in  amino  acid  replacement  is  about  3/4  and  there  are  three  nucle- 
otides in  a codon,  the  mutation  rate  per  nucleotide  per  generation  is  estimated 
to  be  4.4  x 10 _B.  Human  germ  cells  divide  about  50  times  before  gametes 
are  produced.  Thus,  the  mutation  rate  per  cell  division  is  close  to  Watson's 
estimate. 

It  should  be  noted,  however,  that  Kimura's  estimate  is  based  on  only  two 
confirmed  new  mutations.  Therefore,  his  estimate  may  well  change  in  the 
future.  Recently,  Neel  (1973)  estimated  the  rate  for  electrophoretically  detect- 
able mutations  in  enzymatic  genes  is  TO-4  per  locus  per  generation  from  the 
balance  between  mutation  and  loss  of  alleles  in  the  Yanomama  and  Makirite 
populations  of  American  Indians.  It  is  the  same  order  of  magnitude  as 
Kimura's  estimate.  However,  Neel’s  estimate  may  be  a gross  overestimate 
if  there  is  migration  between  the  Yanomama- Makirite  and  their  neighboring 
populations.  It  is  also  known  that  the  estimate  obtained  by  his  method  is 
subject  to  a large  standard  error  even  if  there  is  no  migration. 

If  a mutation  results  in  malfunctioning  of  a protein  or  RNA,  the  mutation 
will  be  eliminated  from  the  population  rather  quickly.  A majority  of  muta- 
tions seem  to  be  of  this  type.  On  the  other  hand,  if  a mutation  does  not 
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Rates  of  amino  acid  substitutions  (accepted  point  mutations)  per  residue  per  109  years  in 
certain  proteins.  From  McLaughlin  and  Dayhoff  (1972). 
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affect  the  function  of  the  protein  or  RNA  produced  or  improve  it,  the 
mutant  cistron  may  increase  in  frequency  in  the  population  and  finally 
substitute  the  original  type.  DayhofT  et  al.  (1972a)  called  such  mutations 
accepted  point  muiatims.  If  substitution  of  genes  in  populations  occurs 
mostly  by  random  genetic  drift,  it  can  be  shown  that  the  rate  of  gene  sub- 
stitution per  unit  length  of  time  is  equal  to  the  mutation  rate  (ch.  5).  Therefore, 
if  we  assume  that  the  majority  of  accepted  point  mutations  are  selectively 
neutral,  the  mutation  rate  can  be  estimated  from  the  rate  of  gene  substitution 
or  amino  acid  substitution  in  proteins. 

The  rate  of  amino  acid  substitutions  in  evolution  has  been  studied  for 
a number  of  proteins.  Table  3.6  shows  the  rates  of  amino  acid  substitutions 
per  residue  for  the  proteins  so  far  studied.  As  will  be  seen  in  ch.  8,  the  rate 
of  amino  acid  substitution  is  roughly  constant  per  year  rather  than  per 
generation.  Therefore,  the  rates  in  table  3.6  are  given  in  terms  of  chrono- 
logical time.  It  is  seen  that  the  rate  varies  considerably  with  protein  or 
polypeptide,  the  highest  rate  (fibrinopeptides)  being  more  than  1000  times 
higher  than  the  lowest  rate  (histone  IV).  This  variation  is  believed  to  reflect 
the  constraints  in  amino  acid  sequence  of  proteins  (ch.  8).  Histone  IV 
seems  to  require  a very  rigid  amino  acid  sequence  to  be  functional  and  many 
amino  acid  substitutions  presumably  result  in  deleterious  effects.  We  have 
estimated  the  average  mutation  rate  for  a human  hemoglobin  codon  to  be 
10"  7 per  generation.  If  the  average  generation  time  in  the  past  is  20  years, 
this  corresponds  to  5 x 10“  9 per  codon  per  year.  This  is  the  same  order  of 
magnitude  as  the  rate  of  amino  acid  substitution  for  fibrinopeptides.  This 
suggests  that  the  majority  of  the  mutations  occurring  in  the  fibrinopeptide 
cistron  are  selectively  neutral.  This  problem  will  be  discussed  further  (ch.  8). 

The  reader  may  wonder  why  the  rate  of  acceptable  point  mutations  should 
be  constant  per  year,  while  classical  genetics  has  established  a constancy  of 
mutation  rate  per  generation.  The  explanation  seems  to  be  that  the  type 
of  mutations  studied  in  classical  genetics  is  different  from  the  evolutionarily 
acceptable  point  mutations.  In  classical  genetics,  the  rate  of  mutations  was 
measured  mostly  by  using  deleterious  mutations.  It  is  possible  that  a majority 
of  these  mutations  are  due  to  deletion,  insertion,  or  frameshift  at  the 
molecular  level  or  larger  chromosomal  aberration  (mostly  deletions).  In  fact, 
Magni  (1969)  showed  in  yeast  that  a majority  of  mutations  are  frameshifts 
occurring  at  meiosis.  Muller  (1959)  also  showed  that  a majority  of  lethal 
mutations  in  Drosophila  occur  at  the  meiotic  stage.  Then,  we  would  expect 
that  the  rate  of  deleterious  mutations  is  constant  per  generation  rather  than 
per  year.  On  the  other  hand,  the  evolutionarily  acceptable  mutations  appear 
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Table  3.7 

Relation  of  mutation  rate  to  rate  of  cell  division.  From  Novick  and  Szilard  (1950). 


Generation 

Rate  of  mutation 

Rate  of  mutation 

time,  hours 

per  generation 

per  hour 

2 

2.J  x 10-" 

1.23  x 10  B 

6 

7.5  x I04 

3.25  X 10* 

n 

15.0  X IQ-4 

1.2 5 x 10  * 

to  be  a small  fraction  of  the  total  mutation  and  occur  almost  at  any  time. 
In  classical  genetics  these  mutations  were  almost  never  measured. 

In  microorganisms  there  is  evidence  that  the  rate  of  nondeleterious 
mutations  depends  largely  on  chronological  time  rather  than  generation 
time.  In  a chemostat  experiment  of  Escherichia  coli,  Novick  and  Szilard 
(1950)  showed  that  the  rate  of  mutations  from  the  wild-type  to  the  phage- 
resistant  type  is  proportional  to  chronological  time  (table  3.7).  Nevertheless, 
there  is  some  evidence  that  replication  of  genes  is  required  for  mutation 
(Ryan,  1963). 


Table  3.8 


Numbers  of  subunits  and  subunit  molecular  weights  of  proteins  and  enzymes.  Modified 
from  Darnall  and  Klotz  (1972). 


Protein 

No.  of 

Subunit 

Protein 

No.  of 

Subunit 

subunits 

MW 

subunits 

MW 

Acid  phosphatase 

2 

42,000 

Lactate 

A 

35,000 

dehydrogenase 

Alcohol 

2 

41,000 

Leucine 

4 

63,500 

dehydrogenase 

amino-peptidase 

Alkaline 

2 

40,000 

Peptidase-A 

2 

46,000 

phosphatase 

Catalase 

A 

57,000 

Peptidase-B 

1 

34,000 

Ceruloplasmin 

8 

18,000 

Peptidase-C 

I 

64,000 

G6PD 

4 

50,000 

Peptidase-D 

2 

50,000 

Glutathione 

2 

56,000 

Phosphoglucose 

2 

61,000 

reductase 

isomerase 

Group  specific 

2 

2^m 

Pyruvate  kinase 

A 

57,200 

Haptoglobin 

4 

a - 9,100 
0 = 36,000 

6PGD 

2 

441,000 

Hemoglobin 

16.000 

Transferrin 

77,000 

Rate  of  spontaneous  nntfafiott 


33 


It  is  often  required  to  know  the  mutation  rate  per  locus  or  per  cistron, 
since  the  unit  of  gene  function  is  generally  'cistron'  corresponding  to  'poly- 
peptide'. If  we  assume  neutral  mutations,  this  value  can  be  obtained  by 
multiplying  the  rate  of  amino  acid  substitution  in  table  3.6  by  the  total 
number  of  codons  per  polypeptide.  We  shall  use  this  method  to  estimate  the 
average  mutation  rate  for  enzymes  and  proteins  which  are  often  used  in 
population  genetics.  These  enzymes  and  proteins  are  generally  larger  than 
the  proteins  given  in  table  3.6  and  the  amino  acid  sequences  are  not  known. 
A list  of  ihc  molecular  weights  for  the  subunit  polypeptides  for  commonly 
used  proteins  and  enzymes  in  population  genetics  is  given  in  table  3.8.  The 
average  molecular  weight  of  the  polypeptides  is  44,657.  Since  the  average 
molecular  weight  of  an  amino  acid  is  110  (Smith,  1966),  the  number  of 
codons  per  cistron  is  estimated  to  be  about  400.  On  the  other  hand,  the 
mean  and  the  median  of  the  rate  of  amino  acid  substitution  per  codon  for  the 
proteins  in  table  3.6  are  1.8  x 10“  9 and  1 x 10“  respectively.  We  shall 
use  the  median,  since  the  number  of  proteins  examined  is  still  small. 
Therefore,  the  rate  of  amino  acid  substitutions  per  polypeptide  is  estimated 
to  be  4 x 10-  7 per  year,  which  is  equal  to  the  neutral  mutation  rate  under 
the  assumption  we  made.  Note  that  this  does  not  include  deleterious  muta- 
tions which  would  never  be  fixed  in  the  population. 

In  population  genetics  mutant  alleles  are  often  detected  by  electrophoresis 
of  the  protein  produced.  As  mentioned  earlier,  however,  electrophoresis  can 
detect  only  about  a quarter  of  the  total  mutations.  Therefore,  the  rate  of 
electrophoretically  detectable  mutations  is  estimated  to  be  ID-7  per  locus 
on  the  average.  Kimura  and  Ohta  (1971a)  have  reached  the  same  estimate 
in  a slightly  different  way. 

Recently,  Tobari  and  Kojima  (1972)  studied  the  mutation  rate  for  ten 
enzyme  loci  (a-glycerophosphate  dehydrogenase,  malate  dehydrogenase- 1, 
alcohol  dehydrogenase,  isocitrate  dehydrogenase,  esterase-6,  adult  alkaline 
phosphatase,  esterase-c,  octanol  dehydrogenase,  xanthine  dehydrogenase, 
aldehyde  oxidase)  in  Drosophila  melanogaster.  They  found  three  electro- 
phoretically detectable  mutations,  but  two  of  them  did  not  follow  simple 
Mendelian  inheritance.  Their  estimate  of  mutation  rate,  based  on  the  three 
mutants,  was  4.5  x 10" 6 per  locus  per  generation.  This  is  not  unreasonable 
if  it  includes  deleterious  mutations.  Mukai  (personal  communication)  is 
also  conducting  an  experiment  to  estimate  the  mutation  rate  for  enzyme  loci 
in  D.  melanogaster.  So  far  he  has  observed  a single  mutant  and  estimates 
that  the  rate  of  electrophoretically  detectable  mutations  is  about  10” 6 per 
locus  per  generation. 
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Clearly,  more  studies  should  be  made  to  determine  the  mutation  rate  for 
enzyme  loci.  Without  reliable  estimates  of  mutation  rate,  it  is  difficult  to 
understand  the  mechanism  of  maintenance  of  genetic  variability  as  well  as  of 
evolutionary  change  of  populations. 
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4.1  Natural  selection  and  mathematical  models 

In  population  genetics  natural  selection  means  the  differential  rates  of  re- 
production among  different  genotypes.  Thus,  when  viability  and  fertility 
are  the  same  for  all  genotypes,  there  is  no  natural  selection.  Natural  selection 
is  an  important  factor  that  causes  adaptive  change  of  populations.  It  is  well 
known  that  most  organisms  are  adapted  amazingly  well  to  the  environment 
in  which  they  live.  It  is,  therefore,  very  important  to  know  how  natural 
selection  operates  in  nature.  On  the  other  hand,  populations  or  organisms 
sometimes  change  nonadaptively  primarily  because  of  stochastic  elements 
in  gene  frequency  changes.  In  the  present  section,  we  shall  study  the  modes 
and  effects  of  natural  selection,  using  deterministic  models.  Stochastic 
changes  of  gene  frequencies  will  be  discussed  in  the  next  chapter. 

Natural  selection  is  an  extremely  complicated  biological  process.  The  mode 
of  selection  depends  on  many  physical  and  biological  factors.  The  selective 
advantage  of  a genotype  over  another  may  depend  on  temperature,  popula- 
tion density,  availability  of  resource,  predation  by  other  species,  and  many 
other  factors,  which  need  not  remain  constant  from  time  to  time  in  nature. 
It  would  suffice  to  give  one  example  to  show  how  the  real  process  of  selection 
is  affected  by  environmental  or  ecological  factors.  In  fig.  4.1  are  shown  the 
adult  survivorships  of  the  three  genotypes  + / + , +/b,  and  bjb  of  the  flour 
beetle  (Tribolium  castaneum)  in  pure  and  mixed  cultures,  where  b stands  for 
the  black  gene.  There  are  four  different  levels  of  population  density.  In  pure 
culture  the  survivorship  is  not  much  affected  by  density.  Particularly,  the 
wild-type  genotype  +/+  has  about  73  percent  in  all  densities.  In  mixed 
culture,  however,  the  survivorship  is  affected  not  only  by  density  but  also 
by  genotype  frequency.  For  example,  the  survivorship  of  + /+  is  low  when 
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Fig.  4.1.  Adult  survivorships  expressed  as  percentages  of  egg  input  for  four  densities  and 
three  gene  frequencies  in  Tribolium  castaneum.  Leftmost  column  represents  the  results  of 
rearing  the  beetles  in  pure  culture.  White  bars  represent  the  +/  + genotype,  gray  bars  + /b, 
and  black  bars  h/h.  The  densities  5/g,  etc.,  denote  five  beetles  per  gram  of  medium,  etc. 

From  Sokal  and  Karten  (1964). 


the  frequency  of  this  genotype  is  high  but  becomes  higher  when  the  frequency 
decreases.  Here,  clearly  'minority  advantage'  is  observed. 

Another  factor  which  complicates  the  mode  of  natural  selection  is  the 
presence  of  a large  number  of  loci  segregating  in  a population  and  the  inter- 
action of  these  loci  in  the  process  of  natural  selection.  Natural  selection 
operates  among  individuals  rather  than  among  genes,  as  stressed  by  Wright 
(1931).  Therefore,  if  a large  number  of  interacting  loci  are  involved,  the 
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description  of  the  process  of  natural  selection  becomes  enormously  com- 
plicated. In  order  to  develop  a scientific  theory  of  natural  selection,  however, 
we  must  abstract  from  nature  some  important  factors  and  then  make  a 
model  of  selection.  The  model  is  always  unrealistic  in  some  respects.  If  the 
model  is  as  complex  as  the  real  situation  in  a specific  case,  it  is  no  longer  a 
model.  It  lacks  the  generality  that  is  required  for  a model.  Nevertheless,  a 
model  must  be  able  adequately  to  describe  the  process  under  study.  Our 
ultimate  aim  is  to  understand  the  biological  principles  that  underlie  the 
processes  of  genetic  change  of  populations.  If  the  model  does  not  give  any 
insight  into  the  actual  genetic  processes,  it  is  useless. 

In  the  present  chapter  we  shall  first  discuss  the  growth  and  regulation  of 
populations  and  then  some  basic  mathematical  models  of  natural  selection. 


4.2  Growth  and  regulation  of  populations 

4-2-  i Continuous  time  model 


1)  Exponential  growth 

When  abundant  resource  and  space  are  available,  a population  of  organisms 
increases  exponentially.  Let  Nt  be  the  number  of  individuals  at  time  t,  and 
assume  that  in  an  infinitesimal  time  interval  At  a fraction  a/it  of  the  popula- 
tion produce  an  offspring  and  a fraction  bAt  die.  The  change  in  population 
size  during  this  interval  is 


Putting  At 


AN,  = (a  - h)NtAl. 


0,  we  have 


dN, 


= mN„ 


(4.1) 


{4.2) 


where  w is  a - b.  Solution  of  the  above  formula  gives 

N,  = N0emt.  (4.3) 

In  population  genetics  m is  called  the  Malthusian  parameter. 


2)  Logistic  growth 

In  reality,  resource  and  space  are  always  limited,  so  that  a population  cannot 
grow  exponentially  forever.  In  this  case  the  differential  equation  (4.2)  may 
be  changed  in  the  following  way. 
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^-  = mNt(  1 (4,4) 

where  /(TV,)  is  a function  of  N,.  A simple  form  of  f(Nt)  is  NJK , where  K 
is  a positive  constant.  In  this  case  population  size  increases  if  Nt  < K, 
whereas  it  decreases  if  Nt  > K.  Therefore,  population  size  eventually  becomes 
equal  to  K.  K is  often  called  the  carrying  capacity  of  the  environment,  while 
NJK  is  called  the  Verhulst- Pearl  factor.  Equation  (4.4)  can  then  be  in- 
tegrated and  we  have 


N,  = 


K 

1 + c0e-mt 


(4.5) 


where  c0  = (K  - N0)/N0.  The  above  equation  is  called  the  logistic  equation. 
There  are  many  data  which  support  the  approximate  validity  of  the  logistic 
equation  (Lotka,  1956).  However,  the  biological  interpretation  of  /(TV,)  = 
NJK  varies  considerably  in  individual  cases. 


4.2.2  Discrete  generation  model 


1)  Geometric  growth 

In  the  study  of  natural  selection,  it  is  often  convenient  to  use  discrete 
generation  models  rather  than  continuous  time  models.  The  former  give  a 
deeper  insight  into  the  process  of  natural  selection  than  the  latter.  Let  Nt 
be  the  number  of  adult  individuals  at  generation  t.  We  designate  by  k and  v 
the  fertility  and  viability  of  an  individual.  The  reproductive  value  is  then 
given  by  W = kv.  The  formulae  equivalent  to  (4.2)  and  (4.3)  in  the  continuous 
time  model  are  given  by 

AN,  = Nt+l  - Nt 

= (W  - 1 )Nt,  (4,6) 

and 

JVr  “ W'No.  (4.7) 

respectively.  In  population  genetics  W is  called  the  Wrightian  fitness  in 
contrast  to  the  Malthusian  parameter. 

2)  Logistic  growth 

We  can  incorporate  into  (4.6)  a population-regulating  factor  /(TV,)  = NJK 
as  in  (4.4).  It  becomes 
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W-  1 

AN,  = — {K  - N,)Nr  (4,8) 

Mathematically,  N,  does  not  necessarily  converge  to  its  equilibrium  value, 
K (Maynard  Smith,  1968a).  In  fact,  if  W > 3,  the  population  size  may  diverge 
with  oscillation;  if  1 < W < 2,  it  approaches  K without  oscillation;  and  if 
2 c W < 3,  it  converges  to  K with  oscillation.  Therefore,  only  when  I < W< 
2,  the  population  size  increases  logistically.  However,  this  interpretation  does 
not  have  much  biological  meaning.  In  practice,  N,  would  rarely  become 
larger  than  K,  since  K is  the  carrying  capacity  by  definition.  For  example,  if 
the  number  of  adult  individuals  is  limited  by  the  number  of  territories,  Nt 
will  never  be  larger  than  K,  even  if  the  number  of  young  exceeds  K.  The  same 
situation  would  occur  if  N,  is  determined  by  the  amount  of  resource  available. 
Thus,  the  applicability  of  (4.8)  should  be  restricted  to  the  range  of  Nt  < K. 
If  N,  reaches  K,  then  N,  should  remain  constant.  Namely,  even  if  W > 2,  no 
oscillation  will  occur  in  practice. 

In  population  or  evolutionary  genetics  long-term  changes  of  gene  fre- 
quencies are  important,  so  that  in  most  cases  the  population  size  can  be 
assumed  to  be  constant.  In  this  book  we  will  be  mostly  concerned  with  the 
genetic  change  of  a population  rather  than  the  change  of  population  size. 
The  genetic  change  of  a population  is  a slow  process,  so  that  short-term 
fluctuations  in  population  size  are  unimportant. 


4.3  Natural  selection  with  constant  fitness 

Adaptive  change  of  a population  occurs  by  substitution  of  more  advantageous 
genes  for  existing  ones.  The  process  of  gene  substitution  is  geneially  slow  and 
best  described  by  the  change  of  gene  frequency  in  population.  Advan- 
tageousness of  a gene  depends  on  whether  the  gene  increases  the  fitness  of 
the  genotype  that  carries  the  gene  in  heterozygous  or  homozygous  condition. 
Fitness  is  measured  in  terms  of  the  number  of  offspring  an  individual 
produces.  Since  the  size  of  a natural  population  is  more  or  less  constant  in  an 
ordinary  circumstance,  it  is  often  convenient  to  measure  fitness  in  terms  of 
the  relative  number  of  offspring  among  different  genotypes. 

In  the  classical  theory  of  natural  selection  as  developed  by  Haldane 
(1924a,  b and  1926a,  b),  Fisher  (1930),  and  Wright  (1931),  it  is  customary  to 
assign  a constant  value  of  relative  fitness  for  each  genotype  irrespective  of 
population  size.  Namely,  in  this  theory  population  size  increases  or  decreases 
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geometrically  and  no  regulation  of  population  size  is  taken  into  account 
(section  4.4).  Nevertheless,  this  simple  theory  is  useful  for  getting  a rough 
idea  about  how  the  genetic  structure  of  population  changes  by  natural 
selection.  In  the  following  we  consider  the  basic  principles  of  this  theory. 

There  are  two  kinds  of  models:  the  continuous  time  model  and  the  discrete 
generation  model.  In  the  continuous  time  model  the  fitness  of  a genotype 
is  expressed  by  the  Malthusian  parameter,  while  in  the  discrete  time  model 
it  is  measured  by  the  Wrightian  fitness.  When  generations  are  overlapped, 
the  former  is  more  realistic.  However,  if  the  age  distribution  of  the  members 
of  the  population  remains  constant,  the  gene  frequency  change  can  be 
described  approximately  by  the  discrete  generation  model  (Haldane,  1926b; 
Charlesworth,  1970).  We  shall  consider  only  the  discrete  time  model  in  this 
book.  The  reader  who  is  interested  in  the  continuous  time  model  may  refer 
to  Crow  and  Kimura's  (1970)  book. 

4.3.1  Selection  with  a single  locus 

Consider  a pair  of  alleles,  A , and  A„  at  a locus  in  a randomly  mating  diploid 
population.  We  assume  that  generations  are  discrete.  Let  xl  and  x2  (= 
1 - Vj)  be  the  relative  frequency  of  genes  A,  and  A,  in  a generation, 
respectively,  and  designate  the  fitnesses  of  the  three  possible  genotypes 
A,  A j,  A,  A 2,  and  A,  A,  by  W, ,,  W12,  and  W22,  respectively.  Underrandom 

Table  4.1 


Frequencies  and  fitnesses  of  genotypes  AiAi,  /l M2,  and  A2A2  at  a locus. 


Genotype 

A1A1 

A\A% 

A2A  2. 

Frequency 

Xl2 

2x\xz 

*ia 

Fitness 

Wn 

W12 

W22 

mating,  the  frequencies  of  the  three  genotypes  before  selection  follow  the 
Hardy- Weinberg  proportions  and  become  as  given  in  table  4.1.  The  gene 
frequency  in  the  next  generation  is  therefore  given  by 

*;  = [*X,  + d/2)  * 2* 

= (4.9) 

where  W = x\  Wtl  + 2x L x j Wx  2 + x\W  22  is  the  mean  fitness  of  the 
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population.  The  amount  of  change  in  gene  frequency  per  generation  then 
becomes 

Ax,  = x)  - x 1 


= - Wl2)  + ( 

This  can  also  be  written 


- xt)(Wl2  - (4.10) 


Ax,  = 


Xt(]  -xE)  AW 


2W 


Ax. 


t4.M) 


since  dW/dx,  = 2[xl(lT11  - W12)  + (1  - xfi{W12  - ItAj)]  (Wright, 
1937).  From  (4.10)  or  (4.1 1),  it  is  easy  to  see  that  Ax,  depends  on  the  relative 
values  of  W xx,  W12,  and  IT22  and  not  on  the  absolute  values.  Thus,  we  can 
write  Wx  L = 1,  Wxl  ^ 1 - h,  and  W22  = 1 — lor  iVil  = 1 — sil  Wl2  — 
I,  and  ir22  = 1 - 5 2.  The  quantities  A,  s,  etc.,  are  called  selection  coefficients. 

Let  us  consider  some  special  cases. 

1)  Semidominant  gene  (W11  = 1,  Wl2  = 1 - s/2,  W22  =1-5). 


Ax,  = sxix2j(2W), 

(4.12) 

2)  Completely  dominant  gene  (W,  , 

= wi2  = 1,  W22  = 1 

- s). 

Ax,  = 

sxLxl/W. 

(4.13) 

3)  Completely  recessive  gene  ( Wl  y , 

= L wl2  = W21  = I 

- s). 

Ax,  = 

sxfxj/R7- 

(4.14) 

4)  Overdominant  gene  (IT, ! = 1 - 

5i,  Wi2=  1,  1T22  = 

1 - s2). 

Ax j - 

- (Si  +- 

(4.15) 

Formulae  (4.10)-(4.15)  are  nonlinear  difference  equations,  so  that  it  is  not 
easy  to  solve  for  the  gene  frequency  in  an  arbitrary  generation,  though  it  is 
not  impossible  (see  Haldane  and  Jayakar,  1963a).  Of  course,  if  a high-speed 
computer  is  available,  the  gene  frequency  can  easily  be  obtained  by  recur- 
rence formula  (4.9),  starting  from  a given  initial  value.  Thus,  the  entire 
process  of  gene  frequency  change  can  be  studied.  In  (4.12)-(4.14)  Ax,  is 
always  positive  as  long  as  s remains  positive.  Therefore,  the  frequency  of 
A , always  increases  until  it  is  fixed  in  the  population.  On  the  other  hand, 
dXj  in  (4.15)  is  positive  if  xt  is  less  than  Xj  = s2!(st  + s2)  but  negative  if 
xx  is  larger  than  2,.  Therefore,  the  frequency  of  A,  tends  to  be  x,,  where 
.-lx,  = 0.  We  shall  discuss  this  problem  in  more  detail  later. 
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If  selection  coefficients  are  small,  W is  close  to  1 and  Ax,  is  small.  In  this 
case,  formula  (4.10)  can  be  approximated  by 

A = 41  - *)[*<»', , - w,2)  + (1  - xXW,2  - Pfiaffl,  (4.16) 

where  x = x,  and  t stands  for  time  in  generations.  It  is  easy  to  solve  the 
above  differential  equation. 

For  a semidominant  gene,  (4.16)  becomes 

dx  1 

— --sx(l-x),  (4.17) 


or 


dx 

*41"-  *) 


1 . 

- 5df. 


Integrating  this  equation,  we  have 


t 


-log, 


~ *o) 
x0(i  _ xty 


(4.1&) 


or 


(4.19) 


where  x0  is  the  initial  frequency  of  x.  Therefore,  the  gene  frequency  increases 
logistically  (compare  this  formula  with  (4.5)).  For  the  cases  of  dominant 
and  recessive  genes,  we  can  get  similar  formulae;  in  these  cases  it  is  more 
convenient  to  use  the  formulae  equivalent  to  (4.18)  rather  than  to  (4.19), 
as  given  in  Crow  and  Kimura’s  (1970)  book.  They  become  as  follows: 

For  a dominant  gene, 


For  a recessive  gene, 


i r jr,(i  - jcp)  _j i l 

5 l - *,)  + 1 - Xt  l - *„]' 

gene, 

f 1 I",  Xr(l  - .Xn).  __  1 i -i 

s Ll0g’  x0(L  - x,)  x,  +4]- 


(4.20) 


(4.21) 


These  formulae  are  useful  when  we  want  to  know  the  number  of  generations 
required  for  gene  frequency  to  change  from  a given  value  to  another. 
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Fig.  4.2.  Patterns  of  gene  frequency  changes  for  dominant  (solid  line),  semidominant 
(broken  line),  and  recessive  (dotted  line)  genes  under  selection.  The  initial  gene  frequency 
(xo)  is  0.01  and  the  selection  coefficient  (s)  is  0.01  in  all  cases. 


In  fig.  4.2  the  patterns  of  gene  frequency  changes  for  dominant,  semi- 
dominant, and  recessive  genes  are  given,  starting  from  x0  = 0.01.  In  all 
cases  s = 0.01  is  assumed.  The  frequency  for  semidominant  genes  increases 
logistically  and  reaches  0.999  in  about  2000  generations.  The  frequency  of 
dominant  genes  increases  rapidly  in  early  generations  but  the  rate  of  increase 
becomes  very  small  in  later  generations.  On  the  other  hand,  the  gene  fre- 
quency of  recessive  genes  increases  very  slowly  when  it  is  small  but  very 
rapidly  when  it  is  large. 

Although  the  above  theory  for  the  change  of  gene  frequency  has  been 
known  for  almost  fifty  years,  there  are  surprisingly  few  data  from  natural 
populations  to  support  it.  This  is  mainly  because  the  gene  frequency  change 
in  a population  is  generally  so  slow  that  it  is  difficult  for  one  person  to 
describe  the  whole  process  in  his  lifetime.  Nevertheless,  there  are  a large 
number  of  laboratory  experiments  which  support  the  theory.  These  ex- 
periments were  mostly  conducted  with  recessive  lethal  genes  in  Drosophila 
melanogaster,  and  the  agreement  between  the  theory  and  observations  is 
quite  satisfactory  (e.g.  Wallace,  1968).  On  the  other  hand,  the  results  with 
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nonlethal  genes  are  less  satisfactory  and  suggest  that  the  real  process  of 
natural  selection  is  generally  more  complicated  (Merrell,  1965).  One  such 
example  will  be  discussed  later. 

There  appear  to  be  several  reasons  for  the  discrepancy  between  the  theory 
and  observation  for  nonlethal  genes.  The  following  are  important.  1)  The 
assumption  of  random  mating  is  not  necessarily  fulfilled  in  real  populations. 
2)  Although  fitness  includes  fertility  as  a component,  the  detailed  aspects  of 
fertility  differences  between  genotypes  or  mating  types  are  not  taken  into 
account  in  the  above  theory  (Bodmer,  1965).  3)  The  above  theory  is  based  on 
discrete  time  models,  while  laboratory  populations  are  often  maintained 
with  overlapping  generations.  When  generations  are  overlapping,  the  above 
theory  is  applicable  only  when  the  age  distribution  of  the  members  of  the 
population  is  in  a stable  form.  4)  Laboratory  populations  are  sometimes  so 
small,  that  random  genetic  drift  obscures  the  deterministic  change  of  gene 
frequency.  5)  Linkage  and  gene  interaction  may  upset  the  theory,  as  will  be 
seen  in  the  following.  6)  The  assumption  of  constant  fitness  does  not  always 
hold. 

4.3.2  Selection  with  multiple  loci 

When  two  or  more  loci  are  considered  together,  the  genetic  structure  of  a 
population  cannot  be  described  by  gene  frequencies  alone.  This  is  because 
the  frequency  of  a chromosome  type  is  not  necessarily  the  product  of  the 
frequencies  of  the  genes  involved.  A more  fundamental  parameter  in  this 
case  is  apparently  chromosome  frequency  rather  than  gene  frequency. 

Let  us  consider  two  loci  each  with  two  alleles,  A,,  A,  and  B,,  B2 ■ There 
are  four  different  types  of  chromosomes  possible  with  these  loci,  i.e.,  A,  Bly 
A,B2 , A2B u and  A,  B,.  Let  Xlr  X2,  X3,  and  X4  be  the  frequencies  of  these 
chromosomes,  respectively.  The  gene  frequencies  of  A,,  A„  B,,  and  B2 


are  then  given  by  xt  = XL  + X2,  x2  = X3  + X4,  y,  = X_  + X3,  and 
y2  = X 2 + X4,  respectively.  The  chromosome  frequencies  are  not  necessarily 
given  by  the  products  of  gene  frequencies  involved.  Namely, 

(4,22a) 

X3  = - D, 

(4.22b) 

= -Wi  ~ D, 

(4.22c) 

X4  -*i  y±  + P. 

(4.22d) 
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where  D = XtX4  - X2X3  is  called  linkage  disequilibrium.  It  is  easy  to 
prove  the  above  equations.  For  example, 

*,>q  + J>  = {Xt  + AjX^i  + *3)  4-  XtXA  - XzXi 
= xi{xl  ■+■  x2  + *3  + X,)  = Xl. 

When  D = 0 in  a population,  this  population  is  said  to  be  in  linkage 
equilibrium.  Only  in  this  case  can  the  chromosome  frequencies  be  expressed 
as  the  products  of  gene  frequencies. 

With  two  loci  each  with  two  alleles,  there  are  nine  possible  genotypes. 
The  frequencies  of  these  genotypes  under  random  mating  can  be  obtained  by 
expanding  (X1A1B1  + X2AlBz  + X3A2Bl  + X4A2B2)1.  They  are  given 
in  table  4.2,  together  with  genotype  fitnesses. 

Table  4.2 


Frequencies  and  fitnesses  of  nine  possible  genotypes  for  two  loci  each  with  two  alleles. 


A\A  [ 

A\Ai 

AsAi 

BiBj. 

Frequency 

*i2 

2*i*a 

Fitness 

Wxi 

fV13 

tV3j 

BiBz 

Frequency 

2X±Xz 

2JATA4  + XtW 

2-YzXt 

Fitness 

Wl2 

Wu  = W23 

B2B2. 

Frequency 

JtV 

Fitness 

mi 

WZ4 

W44 

The  double  heterozygotes  are  composed  of  coupling  and  repulsion 

(AiBslAsHi)  genotypes.  The  frequencies  of  AiBifAafh  and  AiRifAtBi  are  2XiX*  and 
2*2*8,  respectively. 


In  the  absence  of  selection  the  chromosome  frequencies  in  the  next 
generation  can  be  obtained  in  the  following  way.  We  first  note  that  there 
are  two  ways  in  which  chromosome  A1B1  in  generation  t + 1 is  produced 
from  the  genotypes  in  generation  t.  First,  it  may  be  derived  from  genotypes 
A^B^j--  without  recombination,  where  notation  - refers  to  an  arbitrary 
allele  at  the  specified  locus.  The  probability  of  this  event  is  1 - r,  where  r 
is  the  recombination  value  between  the  two  loci.  Second,  the  AXBX  chromo- 
some may  be  a product  of  recombination  in  genotypes  . The  proba- 

bility of  this  event  is  r.  The  frequency  of  genotypes  At-I-Bt  is  of  course 
Xxjq.  Since  the  gene  frequencies  in  a large  random  mating  population 
remain  constant  in  all  generations,  we  have 
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X?+l)  = (i  - r)X*p  + rxrft. 

(4.23a) 

Similarly, 

x"*1'  = (1  - j-Wj0  + rxtyt. 

(4,23  b> 

XV"  = d “ r)*S’  + 

(4,23c) 

* 

■+ 

1 

— r 

+ 

X 

v; 

(4.23d) 

If  we  note  that  xxyx  = + X2){Xx  + ^3)  = Xx(l  - 

xiy2  = {XL  + x2)(X2  + x4)  = x2(i  - x3)  + xxXt, 

expressions  can  also  be  written  as 

■ X4)  + X±Xlt 
etc.,  the  above 

^V+l1  “ 

(4.24a) 

.Yj + n = ^ + rD*‘\ 

(4,24b) 

X'l+1)  = X™  + tD™, 

(4.24c) 

X'ff = A^J  - rD'r\ 

(4-24d) 

where  Dlt)  is  From  (4.22a),  we  have  X ^ + D{t) 

and  = xxyx  + Dt‘r+iK  Putting  these  into  (4.24a),  we  have 

= (J  - r)D*'>, 

DiJ)  = (t  - (4.25) 

where  is  the  initial  value  of  linkage  disequilibrium.  Therefore,  linkage 
disequilibrium  declines  at  a rate  of  r per  generation  under  random  mating. 
If  r is  small,  it  will  take  some  time  for  linkage  disequilibrium  to  be  close 
to  0.  Nevertheless,  we  would  expect  that  in  a single  random  mating  popula- 
tion in  nature  alleles  at  different  loci  are  generally  combined  at  random  unless 
the  recombination  value  is  very  small  or  some  sort  of  strong  natural  selection 
operates.  Tf,  however,  there  is  migration  between  different  populations, 
linkage  disequilibrium  may  be  temporarily  developed  even  between  neutral 
loci  (Cavalli-Sforza  and  Bodmer,  1971;  Nei  and  Li,  1973). 

Let  us  now  consider  the  effect  of  natural  selection.  It  is  not  difficult  to 
obtain  the  chromosome  frequencies  in  the  next  generation  from  table  4.2. 
The  frequency  of  A1Bl  is  given  by 
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x;  = + X,XtW,t  + XtXsW,,  + f X,X'4(I  - ,•)  + rXJC^W^W 

= [A,  IT,  - rW14D]{W,  (4.26a) 

where  Wt  = XlWll  + X2Wl2  + X3lVl3  + X4PV14,  and 

W = Xiw:i  + 2X1X2W12  + 2XtX3Wl2  + 2{XxX4  + X2X3)Wl4 

+ X\Wi2  + 2XxX4W24  + + *2  HU 

Similarly,  the  frequencies  of  AXB2 , A2Bt,  and  A2B2  in  the  next  generation 
are  given  by 


X\  - [Ay W2  + rWuDyW. 

(4.26b) 

X>  = IX2W2  + rWl4D]jW, 

(4.26c) 

A;  - IX4W4  - rWiAD}}Wt 

(4.26d) 

where 

W2  = XxW2l  + XiW2 i -f  x2w22  + X4W24, 

W3  - A + A3HU  + + A4Wr!4l 

= A 1 W4X  + X2W41  + X^WA2  + X4W44. 

The  amounts  of  changes  of  chromosome  frequencies  per  generation  are 
therefore  given  by 

AXt  - X\  - Xx 


= - IF)  - rWltmiW. 

(4.27a) 

AX,  = - W)  + rWl40]/ W, 

(427b) 

AX,  = [X3(PF,  - W)  + 

(4.27c) 

AXt  - [Xt(W4  - W)  - rtV14OJ/W. 

(4.27d) 

These  formulae  are  due  to  Lewontin  and  Kojima  (1960),  but  the  equivalent 
formulae  had  been  obtained  earlier  by  Kimura  (1956)  using  a continuous 

time  model. 

The  above  expressions  are  simultaneous  nonlinear  difference  equations, 
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and  the  general  solutions  are  not  available.  However,  if  we  use  a computer, 
the  chromosome  frequencies  after  an  arbitrary  number  of  generations  can 
easily  be  obtained  by  using  formulae  (4.26).  The  patterns  of  chromosome 
frequency  changes  by  natural  selection  vary  greatly  with  genotype  fitness, 
recombination  value,  and  initial  linkage  disequilibrium.  If  there  is  no  gene 
interaction  between  loci  and  the  initial  linkage  disequilibrium  is  0,  the 
chromosome  frequencies  are  approximately  given  by  the  products  of  gene 
frequencies,  and  the  gene  frequency  at  a locus  changes  independently  of  the 
gene  frequency  at  the  other  locus.  Namely,  the  linkage  disequilibrium  is 
approximately  0 even  if  gene  frequencies  are  changing. 

The  departure  of  chromosome  frequencies  from  linkage  equilibrium  can 
be  measured  in  another  way.  Namely, 

Z = (4.28) 

which  is  related  to  D by 

Z = 1 + DWiXi),  (4.29) 


The  natural  logarithm  of  Z, 


Log,Z  = ]o^Xt  - 4- 


has  the  same  sign  as  that  of  D.  If  D is  0,  logeZ  is  also  0.  If  the  amounts  of 
changes  in  chromosome  frequencies  per  generation  are  small,  we  have 


JlOgtZ  = 


AZ  _ AX, 


AX  2 

’ X~ 


x AX,_ 

*3  *4 


(4.30) 


approximately.  (Mathematically,  the  above  formula  does  not  hold  when  the 
effect  of  the  second  and  higher  order  terms  of  chromosome  frequency 
changes  is  large.  In  practice,  however,  if  the  two  loci  are  loosely  linked  with 
weak  gene  interaction,  it  seems  to  be  a good  approximation  (Kimura,  1965).) 
Substituting  AXt  (i  = 1,  4)  into  the  above  expression,  we  have 

BWofeZ  = W,  - W2  - IFj  + h;  - rW„D  (A-  + A.  + + A-J 


= E-rW,tDX,  (4.31) 

where  E = Wx  - W2  - + W4  and  X = Since  is  the 

average  fitness  of  the  i-th  chromosome,  E measures  the  effect  of  gene 
interaction  or  epistasis  on  fitness.  If  E = 0,  there  is  no  epistasis. 

In  the  case  of  E = 0,  (4.31)  reduces  to 
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WSlo^Z  = - rWyJ>X.  (4.32) 

Since  W,  W,„  and  X are  all  positive  and  Z = 1 + D/(X2  Z3),  logeZ  and  1) 
decrease  if  D is  positive  but  increase  if  D is  negative,  unless  r is  0.  Therefore, 
D eventually  becomes  0.  Namely,  if  there  is  no  epistasis,  the  linkage  dis- 
equilibrium becomes  0. 

If  there  is  epistasis,  the  change  in  logcZ  is  determined  by  E - rWlADX. 
If  E > 0,  logcZ  and  D will  increase  whenever  D is  negative  or  zero.  If  E < 0, 
they  will  decrease  whenever  D is  positive  or  zero.  Thus,  D tends  to  have  the 
same  sign  as  E (Felsenstein,  1965).  Note,  however,  that  E is  not  constant 
when  chromosome  frequencies  are  changing  in  the  presence  of  epistasis. 
Kimura  (1965)  showed  that  if  r is  larger  than  |£T|,  Z rapidly  tends  toward  a 
value  which  is  relatively  stable  even  if  gene  frequencies  are  changing.  He 
called  this  state  quasi-linkage  equilibrium.  For  the  properties  of  this  quantity, 
see  Kimura  (1965),  Feldman  and  Crow  (1970),  and  Nagylaki  (1974). 

From  the  above  discussion,  it  is  clear  that  in  a large  random  mating 
population  linkage  disequilibrium  is  created  only  by  epistatic  selection, 
neglecting  the  small  disequilibrium  produced  by  the  second  order  effect  of 
gene  frequency  changes  (Nei,  1963). 

An  important  aspect  of  linkage  disequilibrium  is  that  the  gene  frequency 
change  at  a locus  may  be  affected  by  selection  at  a second  locus  which  is 
closely  linked  with  the  locus  under  study.  In  general  it  is  not  known  what 
kind  of  selection  is  operating  at  closely  linked  loci.  If  there  is  linkage  dis- 
equilibrium between  two  loci  and  one  of  these  is  subject  to  natural  selection, 
the  gene  frequency  at  the  other  locus  may  change  even  if  there  is  no  selection 
at  all  at  this  locus.  This  would  happen  particularly  in  laboratory  experiments 
in  which  the  initial  chromosome  frequencies  are  artificially  set  up. 

One  possible  example  is  given  in  fig.  4.3,  where  the  frequency  change  of 
allele  F at  the  esterase  6 locus  in  Drosophila  meianogaster  is  compared  with 
a result  of  computer  simulation.  In  this  simulation  the  esterase  locus  is 
assumed  to  be  neutral  but  linked  with  a second  locus  which  is  subject  to 
overdominant  selection.  The  recombination  value  between  the  two  loci  is 
0.15.  The  esterase  6 locus  has  two  alleles  F and  S,  while  the  second  locus  is 
assumed  to  have  alleles  B and  b.  The  fitnesses  of  BB,  Bb,  and  bb  used  are 
0.6,  1,  and  0.9,  so  that  the  equilibrium  gene  frequency  of  B is  0.2  (see 
formula  (4.57)).  The  initial  frequencies  of  chromosomes  FB,  Fb,  SB,  and  Sb 
were  0.2,  0,  0,  and  0.8,  respectively,  in  one  set  (Cage  17)  and  0.8,  0,  0,  and  0.2 
in  the  other  (Cage  18).  In  the  former  case  the  frequency  (y)  of  allele  B was  0.2 
from  the  beginning,  so  that  there  was  no  change.  Consequently,  the  frequency 
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Fig.  4.3.  Frequency  changes  of  the  Fallele  at  the  esterase  6 locus  in  two  cage  populations 
of  Drosophila  melanogaster  studied  by  MacIntyre  and  Wright  (1966)  and  the  results  of  a 
computer  simulation  (broken  lines).  In  this  computer  simulation  the  esterase  6 locus  was 
assumed  to  be  neutral  but  linked  with  an  overdominant  locus  (B  locus),  x is  the  frequency 
of  the  F allele,  while  y is  the  frequency  of  an  allele  at  the  B locus. 


(x)  of  allele  Falso  did  not  change  at  all.  In  the  latter  case,  however,  the  B gene 
frequency  gradually  declined  with  increasing  generation,  and  the  frequency 
of  the  F allele  followed  the  change  of  the  B gene  frequency  in  early  genera- 
tions because  of  linkage,  even  if  this  locus  was  subjected  to  no  selection.  It 
is  clear  that  in  both  cases  the  frequency  change  of  the  F allele  is  close  to  the 
experimental  result.  It  should  be  noted,  however,  that  this  is  not  the  only 
result  of  computer  simulation  which  closely  mimics  the  experimental  data. 
Similar  results  may  be  obtained  by  changing  the  initial  chromosome  fre- 
quencies and  the  recombination  value,  and  also  by  adding  some  more  loci. 
In  fact,  if  we  consider  a number  of  linked  loci,  a similar  result  may  be 
obtained  without  the  aid  of  any  overdominant  loci.  This  sort  of  linkage 
effect  always  makes  it  difficult  to  interpret  experimental  data  properly. 
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4.4  Competitive  selection 

So  far  we  have  assumed  that  genotype  fitness  is  constant  throughout  the 
process  of  gene  substitution.  The  assumption  of  constant  fitness  is,  however, 
equivalent  to  assuming  that  population  size  increases  or  decreases  geometri- 
cally (Feller,  1967;  Moran,  1970).  Suppose  that  the  absolute  fitnesses  of 
A 1A1,  A:A 2,  and  A2A2  are  given  by  1,  1 — s/2,  and  I — s.  Then,  the  rate  of 
population  growth  is  given  by  W - 1 = - sx2  from  (4.6).  Therefore,  if 
s > 0,  population  size  always  decreases  until  x2  becomes  0,  while  if  s < 0, 
it  always  increases.  Namely,  population  size  is  directly  affected  by  the  gene 
under  selection.  In  practice,  however,  population  size  is  generally  controlled 
by  outside  factors.  It  may  be  determined  by  the  total  amount  of  resource 
and  space  available,  irrespective  of  whether  selection  occurs  or  not.  This 
suggests  that  a large  part  of  natural  selection  occurs  by  competition  for 
limited  resources.  The  viability  of  a genotype  would  be  low  when  it  competes 
with  a strong  competitor  but  high  when  it  competes  with  a weak  competitor. 
In  this  case  the  fitness  of  a genotype  will  no  longer  be  constant. 

In  recent  years  several  authors  (e.g.,  Wright,  1969;  Schutz  and  Usanis, 
1969;  Anderson,  1971;  and  Clarke,  1972)  developed  mathematical  models 
for  this  type  of  selection.  In  these  models  genotype  fitnesses  are  expressed 
in  terms  of  genotype  frequencies  and  population  density.  In  most  of  the 
models,  however,  genotype  fitnesses  are  not  derived  as  a logical  consequence 
of  basic  processes  of  natural  selection  but  simply  given  as  a plausible  model. 
An  exception  is  that  of  Mather  (1969),  who  derived  genotype  fitnesses  as  a 
consequence  of  competitive  selection.  In  the  following  I shall  discuss  an 
extension  of  this  model  by  Nei  (1971b),  who  took  into  account  the  regulation 
of  population  size.  Although  this  model  is  simple  and  surely  unrealistic  in 
some  respects,  it  gives  an  insight  into  the  process  of  natural  selection  when 
population  size  remains  constant. 

We  assume  that  population  size  is  controlled  by  two  factors,  i.e.,  'intrinsic 
rate  of  reproduction'  and  'competition'.  It  is  known  that  there  is  little 
correlation  between  the  competitive  ability  and  intrinsic  rate  of  growth  or 
reproduction  (Lewontin,  1955;  Lewontin  and  Matsuo,  1963).  Competition 
may  occur  through  limitations  of  resources  and  space,  the  latter  including 
protective  shelters  against  predation  or  weather  factors  such  as  temperature 
and  humidity. 
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4.4.  / Haploid  model 

Consider  a haploid  population  in  which  two  genotypes,  A , and  A„  with 
respect  to  a locus,  are  present.  Let  nl  and  n2  be  the  numbers  of  adult 
individuals  for  genotypes  A,  and  A„  respectively,  with  N = ny  + n2.  The 
relative  frequencies  are  then  x,  = nfN  and  x2  = n2jN.  In  the  presence  of 
unlimited  resources  and  space,  there  will  occur  no  competition,  so  that  the 
increase  of  the  number  of  each  genotype  will  be  determined  by  its  intrinsic 
rate  of  reproduction.  In  this  case,  the  numbers  of  adult  individuals  for  A , 
and  A,  in  the  next  generation  are 

n\  = (4.33a.) 

n 2 = n2>'i  = (4.33b) 

respectively.  Here  r,  and  r2  are  the  intrinsic  reproductive  values  of  A,  and 
A 2,  respectively.  The  intrinsic  reproductive  values  are  constants  determined 
by  environmental  (physical)  conditions  and  can  be  written  as  kv' s,  where 
Ar’s  and  v's  are  fertility  and  viability,  respectively.  In  the  following  we  assume 
for  simplicity  that  kx  = k2  = k,  and  selection  occurs  through  viability, 
except  for  a special  case. 

In  nature,  however,  resources  and  space  are  limited,  and  competition  may 
occur  between  individuals  for  limited  resources  and  space.  Suppose 
that  two  or  more  individuals  compete  for  a unit  of  food  or  some  other 
resource  (including  space),  and  one  of  them  succeeds  in  getting  it.  The 
number  of  individuals  succeeding  in  a population  will  then  depend  on  the 
number  of  such  units  of  resource  present.  Thus,  if  the  level  of  resource 
present  is  small  compared  with  the  level  required  by  the  competing  individuals 
and  remains  the  same  for  all  generations,  the  population  size  as  measured  by 
adult  individuals  will  reach  the  saturation  level  and  thereafter  remain 
practically  constant.  We  consider  competition  at  the  saturation  level,  where 
kN  offspring  are  produced  in  each  generation  and  N individuals  survive  to 
the  adult  stage.  Namely,  the  average  survival  rate  is  \jk.  Competition  may 
occur  between  individuals  of  the  same  genotype  as  well  as  of  different  geno- 
types. Since  we  have  assumed  no  fertility  difference  between  genotypes,  com- 
petition will  occur  between  A,  and  A,  with  frequency  .sf  (x,  = knf(kN)  = 
n | / ,'V),  between  A,  and  A,  with  frequency  2x,.y2,  and  between  A,  and  A, 
with  frequency  x\. 

Suppose  that  A , has  a higher  competitive  ability  than  A,,  and  when  they 
compete,  A,  wins  with  probability  (1  + s)j 2,  while  A,  wins  with  probability 
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Frequencies  of  competition  occurring  between  the  same  and  different  genotypes  and 
probabilities  of  success  of  the  two  genotypes  in  the  haploid  model. 


Competition 

between 

Frequency 

Probability  of  success 
Ai  A 2 

Ai'Ai 

1 

Ail  A 3 

2xi  x* 

<\ 

(i  - m 

AizA-i 

aF 

i 

(I  - s)/ 2.  When  competition  occurs  between  two  individuals  of  the  same 
genotype,  one  of  them  wins  with  probability  1/2.  The  probability  that  either 
of  the  two  individuals  wins  is,  of  course,  one.  Therefore,  we  obtain  the 
probability  of  success  of  a genotype  in  each  competitive  event  as  given  in 
table  4.3.  Competition  may  occur  once  or  many  times  during  the  life  of  an 
organism.  If  we  assume  that  the  fitness  of  an  individual  is  proportional  to  the 
probability  of  success  in  competition,  then  the  numbers  of  adult  individuals 
in  the  next  generation  under  purely  competitive  selection  are  given  by 

n\  = rr3(l  + (4.34a) 

nz  = «2(1  - sO-  (4,34b) 

In  the  derivation  of  the  above  formulae,  we  used  pairwise  competition.  It 
can  be  shown,  however,  that  the  same  formulae  hold  irrespective  of  the 
number  of  individuals  competing  for  a unit  of  resource,  if  each  individual 
behaves  independently.  Furthermore,  the  same  formulae  are  applicable, 
even  if  there  are  several  different  niches  in  the  habitat  of  a population  (Nei, 
1971b). 

Let  us  now  consider  the  intermediate  stage  between  the  geometric  growth 
of  a population  and  the  saturation  level  in  which  only  competitive  selection 
occurs.  If  population  size  reaches  a certain  level,  the  growth  rate  gradually 
declines.  The  general  pattern  of  population  growth  seems  to  be  logistic. 
In  the  present  context  this  suggests  that  competition  occurs  even  if  the 
population  size  is  below  the  saturation  level  and  some  amount  of  resource 
remains  unutilized.  Perhaps  an  unequal  distribution  of  resource  among 
individuals  causes  some  of  them  to  compete  with  each  other  even  if  unutilized 
resource  remains  in  some  other  locations  of  the  habitat. 

Suppose  that  competitive  selection  occurs  with  a relative  frequency  of  c 
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and  noncompetitive  selection  occurs  with  a frequency  of  1 - cinageneration. 
Then,  we  have 


Ni  = « i[t 1 - e)ri  + c(l  + sx2)],  (4.35a) 

T - flJt,)],  (4.35b) 

where  c is  a function  of  n,  and  n2 ■ The  simplest  form  of  c would  be  N/K, 
which  is  identical  with  the  Verhulst-Pearl  factor  in  the  logistic  equation. 
In  this  case  K represents  the  population  size  at  saturation.  If  N = K,  gene 
substitution  occurs  only  through  competitive  selection.  If  the  population 
size  increases  exponentially  until  the  saturation  level  is  reached,  then  c = 0 
for  N ^ K and  c = 1 for  N = K In  this  formulation  c cannot  be  larger 
than  1.  This  is  because  K is  the  maximum  number  of  individuals  that  can 
be  sustained  by  the  environment.  If  population  size  is  larger  than  K in  a 
generation,  it  is  immediately  adjusted  to  K in  the  next  generation. 

The  Wrightian  fitnesses  of  genotypes  A,  and  A,  are  obtained  by  W y = 
i and  W 2 = n'Jftj,  respectively.  Namely, 

Wx  = (1  - c)rt  + c(l  + sx2),  (4.36a) 

Wz  = <t  - c)r,  4-  e(l  - sxi).  (4J6b) 

From  these  formulae,  we  can  see  that  the  fitness  of  a genotype  under  com- 
petitive selection  is  necessarily  dependent  on  the  genotype  frequency.  It  is 
also  noted  that  for  a given  value  of  c the  relative  fitness  of  a genotype  is 
higher  when  its  frequency  is  low.  This  is  exactly  what  we  have  seen  for  the 
wild-type  genotype  at  the  black  locus  of  the  flour  beetle  (fig.  4.1).  Similar 
minority  effects  have  been  observed  by  Harding  et  al.  (1966),  Kojima  and 
Yarbrough  (1967),  and  others,  though  in  Kojima  and  Yarbrough's  case  the 
mechanism  involved  seems  to  be  different  from  ours. 

The  increases  in  numbers  of  individuals  per  generation  for  the  two  geno- 
types and  the  total  population  are  given  by 

An,  = n1[al  - c(at  - sx2)],  (4.37a) 

A na  = - c{(J2  + 5*|)J,  (4.37b) 

AN  = Na{  1 - c),  (4.37c) 

where  a,  = rx  - I,  a2  = r2  ~ I,  and  a = xlal  + x2a2.  Mathematically,  we 

have  to  assume  0 < a < 1 to  avoid  the  divergence  of  population  size  (see 
section  4.1). 
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The  amount  of  change  in  gene  frequency  of  A , per  generation  (Ax,)  can 
be  obtained  from  (4.37a).  It  becomes 


- tXfl,  - a2)  + cs] 
1 + (1  - c)a 


(4.33) 


This  formula  shows  that  in  an  unsaturated  population  xx  does  not  necessarily 
increase,  if  the  sign  of  a,  — a,  is  not  the  same  as  that  of  s.  However,  if 
the  population  size  reaches  the  saturation  level,  where  c = 1,  we  have 
Axl  = 8X^2- 


4.4.1  Diploid  model 


Consider  the  three  possible  genotypes,  A1A1,  A1A2>  and  A2A2>  for  a pair  of 
alleles  at  a locus.  Let  n,  l5  n12,  and  n22  be  the  numbers  of  adult  individuals 
for  A,  A,,  A1A1,  and  A2A2,  respectively,  with  n, , + n12  + n22  = N.  The 
relative  frequencies  are,  therefore,  Xly  = n,  \\  2 = n12 /N,  and 
X22  = nZ2/N.  We  again  assume  that  selection  occurs  only  through  viability 
and  there  are  no  genetic  differences  in  fertility.  We  denote  by  v,  u vl2,  and 
v22  the  viabilities  of  AyA  l,  AxAly  and  A2A2,  respectively,  in  the  presence 
of  unlimited  resources  and  space,  the  fertility  being  k for  all  genotypes. 
Note  that  Xtl,  X12,  and  X22  do  not  necessarily  follow  the  Hardy -Weinberg 
proportions,  but  the  genotype  frequencies  before  selection  do.  In  the  presence 
of  unlimited  resources  and  space,  the  numbers  of  individuals  of  AtAly 
A lA2,  and  A2A2  in  the  next  generation  will  be  given  by 
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Frequencies  of  competition  occurring  between  the  same  and  different  genotypes  and 
probabilities  of  success  of  the  three  genotypes  in  the  diploid  model. 


Competition 

Frequency 

Probability  of  success 

between 

AiAi 

AiA2 

A2Az 

AnAi’rAiAi 

AiAnAxA* 

XL* 

4x  t aJi 

X 

(1  + /iV2 

(t  - Ji)/2 

AiAizAvAi 

AiAr.AiAi 

ei  + 2 

1 

(1 

— S3  M2 

A-xAii'.AiAf 

AiAzzAtA* 

4a  ua.3 

AS* 

(1  + J*>/2 

Cl 

-it J/2 
1 
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n'u  = (4,39a) 

ttJu  - iNXiXtkHn,  (4.39b) 

n‘22  = Nxlkvjz,  (4.39c) 

respectively,  where  x,  = X, , + X,  2/2  is  the  gene  frequency  of  A,  and 

xr2  = l-  x,. 

The  numbers  of  the  three  genotypes  under  purely  competitive  selection 
can  be  obtained  from  table  4.4,  where  the  probabilities  of  success  of  the 
three  genotypes  are  given.  They  become 

n'u  = Wjcf(l  + 2x1x2s1  + x22s2),  (4.40a) 

n',i  = 2NxtXz{\  - + *1^,),  (4,40b) 

n 'a  = jVjciO  - - 2x,Jf3Sj)-  (4.40c) 

Therefore,  the  genotype  fitnesses  of  A, A,,  AXA2,  and  A2A2  under  purely 
competitive  selection  are  W x , = (1  + 2x1„r2.y1  + Wx,  = (1  - + 

x\s2),  and  3^22  = (1  “ x\si  ~ 2 x1x2s3)y  respectively,  which  are  again 
frequency  dependent. 

The  recurrence  equations  for  n's  when  both  competitive  and  noncom- 
petitive forms  of  selection  operate  are  rather  complicated.  But  the  changes 
in  the  numbers  of  genes  A,  and  A,  (nx  = 2Nxx  and  n2  = 2Nx2,  respectively) 
and  the  total  population  size  per  generation  can  be  written  in  the  same  form 


as  those  for  the  haploid  model.  That  is, 

An,  = nxlax  - c(fl,  - sx2)],  (4.41a) 

Att2  = tr3[n?  — c(fl2  + s jc  L )] (4,41b) 

c).  (4.41c) 

where  a,  = k(xxvx [ + x2vx2)  - 1,  az  = vxi  + X2u2J)  - 1,  d = 


xxax  + x2a2,  and  s = vj ,v , + *,*2j2  + respectively.  Therefore,  the 
formula  for  the  amount  of  change  in  gene  frequency  also  takes  the  same  form 
as  (4.38)  with  the  parameters  defined  here.  In  this  case,  however,  a,,  a2,  and 
i are  not  constant  but  a function  of  gene  frequencies.  So  the  change  in  gene 
frequency  in  unsaturated  populations  can  be  more  complicated  than  that 
for  the  haploid  model. 
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In  saturated  populations  Ax,  can  be  written  as 

AxL  = XtX^xfst  + JtiXjSj  + (4.42) 

In  the  case  of  genic  selection  sx  = s2/l  = s3  = s.  Therefore,  Axl  = Hj.o, 
which  is  essentially  the  same  as  the  formula  for  constant  fitness  (4.12),  if 
is  replaced  by  s/2.  If  A,  is  completely  dominant  over  A„  Sj  =0  and 
s2  = s3  = s,  giving  Ax,  = which  is  again  similar  to  (4.13).  In  the 

case  of  overdominance,  however,  we  get 

dx1  = xtx2(-  xls[  + JfiJf2sz  + (4.43) 

where  s[  = - s,.  Therefore,  only  when  s2  = - a'  + s3.  Ax,  becomes 
similar  to  the  formula  for  constant  fitness  (4.15).  That  is,  Ax,  = - 

(j'  + Jj) JCj}. 

4.4.3  Selection  with  multiple  loci 

So  far  we  have  studied  the  gene  frequency  change  at  a single  locus  in  regulated 
populations,  neglecting  all  alleles  at  other  loci.  In  natural  populations,  how- 
ever, there  are  many  loci  at  which  alleles  are  segregating  and  population 
growth  below  saturation  level  would  generally  be  controlled  by  more  than 
one  locus  except  in  some  special  cases.  The  mathematical  formulation  of 
population  growth  in  such  cases  is  very  complicated.  Fortunately,  most 
natural  populations  are  more  or  less  constant  and  their  size  at  equilibrium 
appears  to  be  controlled  mainly  by  outside  factors  rather  than  the  genes 
under  selection.  Thus,  the  process  of  natural  selection  in  regulated  popula- 
tions may  be  approximated  by  the  model  of  competitive  selection  at  satura- 
tion level  discussed  above. 

In  extending  the  single  locus  theory  to  multiple  loci,  however,  some 
caution  is  required.  Two  different  loci,  A and  B,  may  control  two  entirely 
different  competitive  events  or  the  same  event.  In  the  former  case,  the  two 
genes  are  clearly  independent  in  function.  Thus,  the  fitness  of  genotypes, 
say,  A1B1  in  haploids,  may  be  given  by  (1  + sAx2)(  1 + sBy2),  where  sub- 
scripts A and  B refer  to  loci  A and  B,  respectively,  and  y2  stands  for  the 
frequency  of  allele  B2  at  the  B locus.  Namely,  the  fitness  of  a genotype  may 
be  given  by  the  products  of  the  fitnesses  for  the  component  genotype  at  each 
locus.  Therefore,  the  gene  frequency  change  at  one  locus  is  not  affected  by 
that  of  the  other,  as  long  as  there  is  linkage  equilibrium. 

On  the  other  hand,  if  the  two  loci  affect  the  same  competitive  event,  we 
must  consider  competition  between  all  possible  pairs  of  genotypes.  If  there 
are  r genotypes,  the  number  of  possible  genotype  combinations  is  r(r  - l)/2> 
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Table  4.5 


Competitive  selection  when  two  loci  are  involved  in  the  haploid  model. 


Competition 

between 

Frequency 

Probability  of  success 

A1B1 

AiBz 

AzBi 

A2BZ 

Xi* 

| 

AiBy.AiBt 

2XiXt 

a 

+ st)fl 

(1  - 

Ai.Ri-.Aiih 

2X1X3 

u 

+ sa)I2 

(J 

AiBtiA±R± 

IXiXt 

(s 

+ tiMl 

a 

- ti)}l 

Ai.Bi.AiRt 

Xi 3 

1 

AiBr-AiBi 

2X3Xj 

(1  + 

V - ti)/! 

Ai.Bi\A±Bi 

IXiXi 

{1  + f 4112 

a 

-SaW 

AxB\\AiBi 

X? 

I 

A2.Rv.A1B3 

2X*Xi 

tl  + SbW 

ti 

- ibir 

AtBr.At&i 

Xf 

1 

and  the  number  of  parameters  to  be  specified  for  describing  all  competitive 
events  rapidly  increases  with  r.  Therefore,  there  are  a large  number  of  ways 
in  which  competitive  selection  may  occur.  This  suggests  that  the  actual 
process  of  competitive  selection  in  nature  may  be  extremely  complicated  if 
there  are  a number  of  loci  affecting  the  same  competitive  event.  In  practice, 
however,  the  complete  specification  of  all  the  parameters  is  virtually  im- 
possible, and  to  make  the  mathematical  treatment  manageable  certain 
simplifying  assumptions  must  be  made.  If  the  gene  actions  at  different  loci 
are  independent,  a relatively  small  number  of  parameters  are  required,  and 
rather  simple  formulae  for  the  changes  of  genotype  frequencies  may  be 
obtained. 

To  see  this  point,  let  us  consider  a haploid  population  in  which  alleles 
A,,  A,  and  B,,  B2  are  segregating  at  loci  A and  B,  respectively.  We  have  four 
genotypes  A1  Bu  AlB1,  A2Blf  and  A2B2.  Let  X„  X2,  X3 , and  X4  be  the 
frequencies  of  genotypes  A,B„  A ,B2,  A2BU  and  A,B2  before  selection, 
respectively.  A complete  specification  of  competitive  selections  is  given  in 
table  4.5.  In  the  present  case  there  are  four  genotypes,  so  that  six  competi- 
tion parameters  are  required.  The  genotype  frequencies  after  selection  (X,„ 
X2a,  etc.  for  A1BU  A,  D2,  etc.)  are  then  given  by 

= X*  4-  XiXl(l  + sB)  + XtXi|(l  + -h  X|X4{1  +■  I() 

= *,{1  + ,y25,  + 
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Xta  = — ^fSB  + X3tt  + 

A'j^  = Xj{l  — “ AVl  -f  XAis), 

Xia  = X4{i  - X\tt  - X2St  - 

In  haploid  organisms  mating  occurs  between  adult  individuals  and  im- 
mediately after  mating  meiosis  occurs.  Thus,  the  genotype  frequencies  in  the 
next  generation  are  given  by 


X\  ~ X , Wt  - rD, 

(4.44a) 

X2  = X2W2  4-  r D, 

(4.44b) 

A'  = X^W;  + rb , 

(4.44c) 

X'A  - XaWa  - rD t 

(4.44d) 

where  W„  h^2,  W3,  and  W4  are  the  fitnesses  of  AtBly  A1B2 , . 
A2B2,  respectively,  and  given  by 

4 2^1,  and 

^1^1  + AjSu  + XySj  + XAt^ 

(4.45a) 

W'i  - ] - xtsM  + X a(i  + Xf* 

(4.45b) 

W,  = 1 - - X2h  + *44 

(4.45c) 

= 1 - *1*1  - - a,4 

(4.45d) 

On  the  other  hand,  D is  the  linkage  disequilibrium  after  selection  and  given 
by  Xla  X4a  ~ X2a  X3a.  It  is  noted  that  the  genotype  fitnesses  are  again 
frequency  dependent.  The  amounts  of  changes  of  genotype  frequencies  per 
generation  are  then  given  by 

AX,  = - 1)  - rD, 

(4.46a) 

AX3  = X2{Wi  - 1)  + rbt 

(4.46b) 

AX3  = X3{W3  - 1}  + rb. 

(4.46c) 

AX,  = X4(W4  - 1)  - rD. 

(4.46d) 
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Although  the  mathematical  forms  of  the  above  formulae  are  simple,  they 
depend  on  the  six  competition  parameters  given  in  table  4.5.  In  many  cases 
we  may  assume  that  sA  = s'A,  sB  = s'B,  t,  = sA  + sB  + £l5  and  t2  = sA  - 
s B - £ 2,  where  Si  and  s2  are  epistatic  interactions.  If  these  are  both  0,  then 
the  gene  actions  at  the  two  loci  are  independent.  In  this  case  genotype 
fitnesses  depend  only  on  gene  frequencies,  i.e.,  WL  = 1 + x2sA  + y2sB , 
W2  = 1 + x2sA  - y^B,  W3  = l - x1sA  + y2sB,  and  W4  = 1 - - 

y^B 

As  in  the  case  of  constant  fitness,  linkage  disequilibrium  is  developed  only 
when  there  is  epistasis.  This  can  be  seen  by  putting  the  equations  (4.46)  into 
(4.30).  Clearly, 

6\ OgcZ  = K-  rDX, 

where  X = and  E = Xx(sA  + sB  - t,)  - X2(s'A  — sB  — t2 ) + 

X3(sA  - s'B  - t2 ) - XA{sJA  + - ti).  Thus,  if  there  is  no  epistasis,  E = 0, 

and  D eventually  becomes  0,  as  discussed  earlier. 

From  the  above  discussion  we  can  see  that  competitive  and  noncompetitive 
selections  give  roughly  the  same  result  if  gene  action  is  simple.  In  diploid 
populations  competitive  selection  can  be  more  complicated  than  in  haploids, 
since  the  number  of  possible  genotypes  is  larger  and  a larger  number  of 
competition  parameters  are  required.  For  example,  in  the  case  of  two  loci 
each  with  two  alleles,  there  are  nine  possible  genotypes,  so  that  the  number 
of  parameters  for  complete  specification  of  competitive  selection  is  36. 
However,  this  number  can  be  reduced  considerably  if  we  make  certain 
simplifying  assumptions,  and  the  mathematical  treatment  becomes  similar 
to  that  of  constant  fitness. 

In  practice,  we  generally  do  not  know  what  kind  of  selection  is  operating 
at  a particular  locus  or  loci.  Furthermore,  the  models  of  competitive  and 
noncompetitive  selections  discussed  in  this  chapter  both  deal  with  idealized 
situations.  Which  model  fits  better  to  real  situations  is,  of  course,  an  em- 
pirical question  and  has  to  be  answered  by  data.  It  is,  however,  interesting 
to  see  that  as  long  as  population  size  remains  roughly  constant,  gene  or 
chromosome  frequency  change  can  be  described  by  approximately  the  same 
formula.  For  this  reason,  we  shall  use  the  simple  model  of  constant  fitness  in 
the  following,  whenever  it  is  applicable.  One  important  case  in  which  the 
distinction  between  the  two  models  is  meaningful  is  that  of  fertility  excess 
required  for  gene  substitution. 
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4.5  Fertility  excess  required  for  gene  substitution 

The  essential  process  of  adaptive  change  of  an  organism  in  evolution  is  the 
substitution  of  a more  advantageous  gene  for  a less  fit  gene.  Selective 
advantage  of  a gene  is  conferred  in  many  different  ways.  If  a gene  increases 
the  fertility  of  an  organism  compared  with  other  genes,  it  certainly  has  a 
selective  advantage,  since  the  gene  is  more  rapidly  multiplied  than  the  others. 
Other  things  being  equal,  a gene  which  induces  a shorter  generation  time  is 
also  expected  to  have  a selective  advantage,  since  the  rate  of  increase  of  gene 
number  per  unit  length  of  time  is  high.  In  the  actual  process  of  evolution, 
however,  those  genes  which  control  fertility  and  generation  time  appear 
to  have  played  little  role,  since  fertility  has  declined  from  lower  organisms 
to  higher  organisms  and  generation  time  has  increased.  Rather,  the  evolu- 
tionary change  in  adaptability  has  occurred  mainly  through  the  increase  in 
viability.  For  example,  a female  fruitfly  is  able  to  produce  far  more  than  100 
offspring  but  the  majority  of  them  die  before  maturity,  while  the  female 
fertility  in  man  is  generally  less  than  10  but  the  majority  of  individuals  are 
able  to  live  up  to  maturity. 

Haldane  (1957a,  1960)  showed  that  the  number  of  genes  that  can  be 
substituted  simultaneously  in  a population  depends  on  the  fertility  of  the 
organism  in  question.  According  to  his  theory,  gene  substitution  is  initiated 
by  some  environmental  change,  which  makes  a prevalent  allele  in  the 
population  less  advantageous,  while  a mutant  allele  that  was  originally  less 
fit  becomes  advantageous  and  increases  in  frequency.  The  mutant  allele 
eventually  replaces  the  original  allele  and  becomes  fixed  in  the  population. 
In  the  process  of  gene  substitution  the  less  fit  gene  creates  a reduction  in 
fitness,  and  if  there  are  many  genes  under  substitution  in  the  same  population 
the  total  amount  of  reduction  in  fitness  is  so  large,  that  the  species  may  not 
be  able  to  survive  when  fertility  is  limited.  The  total  amount  of  reduction  in 
fitness  in  the  process  of  gene  substitution  was  called  the  cost  qf'natural 
selection.  This  concept  was  immediately  accepted  and  extended  by  Kimura 
(1961),  who  called  it  the  substitution  load. 

Haldane's  theory  was,  however,  criticized  by  a number  of  authors.  Van 
Valen  (1963)  and  Brues  (1969)  commented  that  gene  substitution  is  the 
process  of  increase  in  population  fitness  and  thus  it  must  be  beneficial  and 
should  not  create  any  cost  to  the  population  except  in  certain  situations.  This 
comment  is  largely  semantic  and  does  not  negate  Haldane's  computation, 
though  semantics  is  quite  important  in  understanding  the  concept  (Turner, 
1972).  On  the  other  hand,  Sved  (1968a)  and  Maynard  Smith  (1968b)  ques- 
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tioned  the  assumption  of  independent  gene  substitutions  at  different  loci. 
Arguing  that  natural  selection  must  be  largely  competitive  since  population 
size  remains  more  or  less  constant  and  the  competitive  ability  of  an  individual 
is  controlled  by  a large  number  of  loci,  they  developed  a model  of  truncation 
selection  in  which  only  the  individuals  whose  competitive  ability  is  higher 
than  a certain  threshold  can  survive  to  adulthood.  As  I have  discussed  else- 
where (Nei,  1971b),  however,  such  a truncation  selection  is  possible  only 
when  competition  occurs  just  once  in  life  for  a single  limiting  resource.  By 
the  time  at  which  competitive  selection  occurs,  all  the  genes  concerned  must 
have  expressed  their  effects  on  a certain  phenotypic  character  which  deter- 
mines the  competitive  ability  of  each  individual.  This  type  of  selection  occurs 
in  artificial  selection  for  quantitative  characters,  but  it  is  questionable 
whether  it  occurs  in  the  process  of  natural  selection.  In  nature,  selection 
operates  at  many  different  stages  of  life  and  for  many  different  reasons.  There- 
fore, it  seems  to  be  reasonable  to  assume  that  competitions  at  different 
developmental  stages  occur  largely  independently.  Of  course,  there  are  some 
clear  exceptions  to  this  (see  Nei,  1971b). 

As  mentioned  earlier,  Haldane  assumed  that  gene  substitution  is  triggered 
by  some  change  of  environment.  He  cites  as  an  example  the  replacement  of 
the  original  light  color  type  of  the  moth  Biston  betularia  by  a melanic  mutant 
type  in  industrial  areas  of  England  (Kettlewell,  1955).  However,  environ- 
mental change  is  not  the  sole  factor  initiating  gene  substitution.  If  a new 
advantageous  mutation  occurs  in  a population,  gene  substitution  may  occur 
without  change  of  environment.  If  the  selective  advantage  of  the  mutant  gene 
is  due  to  a stronger  competitive  ability,  the  population  size  after  gene  sub- 
stitution would  not  be  much  different  from  that  before  substitution,  as 
discussed  earlier.  In  this  case  the  survival  of  a species  would  not  be  affected 
by  the  gene  substitution  unless  there  are  competitor  species  coexisting  in  the 
same  area.  Therefore,  there  are  two  types  of  gene  substitutions  which  can  be 
distinguished  in  terms  of  species  survival.  In  both  cases,  however,  the  number 
of  possible  gene  substitutions  per  unit  length  of  time  is  limited  by  the  fertility 
of  the  species  concerned. 

Let  us  now  consider  this  problem  in  some  detail  by  using  diploid  models 
for  genic  selection.  Dominance  complicates  the  problem  slightly  but  the 
conclusion  is  essentially  the  same.  We  shall  first  consider  competitive 
selection  in  infinitely  large  populations.  In  section  4.3  we  showed  that  the 
fitnesses  of  genotypes  A1A  ,,  AiA2,  and  A2A2  in  a saturated  population  are 
W n = l+  2x1x2s1  + x\s2 , W12  = 1 - x\s i + xls3,  and  W22  = 1 - 
x\s2  - respectively.  In  the  case  of  genic  selection  ^ = s2/l  = 
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s3  = s,  so  that  W j j = 1 + 2 x2s,  Wl2  = 1 — (xt  - x2)s,  and  W22  = 1 - 
2xxs,  while  the  amount  of  change  in  gene  frequency  per  generation  is  Ax,  = 
rv [ x ! .t  or  Ax-  = x(l  - x).s',  where  x = x For  a gene  substitution  to  proceed 
at  this  rate,  the  fitness  of  genotype  A xAt  must  be  1 + 2s(l  - x)  or  higher. 
Namely,  the  fertility  of  an  individual  (k)  must  be  equal  to  or  higher  than 
1 + 2.y(l  - a),  neglecting  the  mortality  due  to  environmental  causes.  If  k 
is  smaller  than  I + 2.?(l  - x),  the  rate  of  gene  substitution  is  slowed  down. 
In  other  words,  a fertility  excess  cf  2j(E  - x)  is  required  for  the  gene  sub- 
stitution  to  proceed  at  a specified  rate.  The  population  size  will  not  decrease 
unless  k is  smaller  than  unity,  as  argued  by  Kimura  and  Crow  (1969)  and 
Crow  (1970).  Of  course,  in  most  organisms  k is  much  larger  than  1 + 2r(l  - 
x)  of  which  the  maximum  is  close  to  3 when  5=1  and  x is  close  to  0.  If, 
however,  more  than  one  gene  substitution  occurs  simultaneously  in  a 
population,  a fertility  excess  of  more  than  2j(1  - x)  is  required.  The  fertility 
excess  required  for  a specified  number  of  gene  substitutions  per  generation 
to  occur  can  be  computed  in  the  following  way. 

First,  we  compute  the  accumulated  fertility  excess  required  (E)  for  one 
complete  gene  substitution.  If  we  approximate  Ax  by  dx/dt,  then  dt  = dx/{sx 
(1  - x)).  Therefore,  the  accumulated  fertility  excess  required  is 

E - | 2s(l  - x)d t 

a 

- _2l0fcX“’  (447) 

*9 

where  x0  is  the  initial  gene  frequency  of  A x.  Interestingly,  this  depends  only 
on  the  initial  gene  frequency  and  is  independent  of  s.  Suppose  that  gene 
substitution  takes  place  at  many  loci  simultaneously  in  a population  and  it 
takes  ts  generations  on  the  average  for  a gene  substitution  to  be  completed. 
At  a particular  locus,  the  fertility  excess  required  for  gene  substitution  in  a 
generation  is  then  E/ts  on  the  average.  In  other  words,  the  average  fertility 
required  is  1 + E/ts.  If  gene  substitutions  at  different  loci  occur  independent- 
ly, the  fertility  required  for  the  joint  substitution  of  r loci  is 

(1  + Bft.  )r  w erE{\  (4.48) 

Therefore,  if  the  average  fertility  of  the  species  is  k,  the  number  of  possible 
gene  substitutions  per  generation  (v)  is  obtained  from  the  relation  k = evE, 
where  v = r/ts.  Namely, 
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v - io&fe/t-  2log,*0).  (4.49) 

In  many  cases  x0  seems  to  be  at  most  0.001,  while  in  mammalian  species  the 
average  fertility  is  often  less  than  10.  If  x0  = 0.0001  and  k = 10,  then  the 
maximum  possible  number  of  gene  substitutions  per  generation  is  0.11. 

Haldane's  original  computation  of  the  cost  of  natural  selection  is  based 
on  constant  genotype  fitness  rather  than  frequency  dependent  fitness.  Let 
the  fitnesses  of  genotypes  A,  A:,  A, A„  and  A2A2  be  1,  1 — s,  and  1 — 2s, 
respectively.  Still  using  x for  the  gene  frequency  of  A,,  the  mean  fitness 
is  W = 1 - 25(1  - x).  Thus,  the  amount  of  reduction  in  fitness  compared 
with  that  of  the  population  of  A2At  only  is  2x(l  — x).  The  gene  frequency 
change  per  generation  again  can  be  approximated  by  dx/d t = 5x(l  - x) 
when  s is  small.  Therefore,  the  accumulated  reduction  in  fitness  is 

SO 

C - J 2A1  -x)d  It  ~ ™ 2Eo|,x0l 

o 

which  is  identical  with  (4.47).  Haldane  called  this  the  cost  of  natural  selection. 
This  cost  becomes  19  if  x0  is  0.0001.  Haldane,  however,  showed  that  it  is 
much  larger  for  recessive  genes  and  suggested  that  the  representative  cost 
for  one  gene  substitution  is  30.  He  then  argued  that  a species  would  devote 
about  10  percent  fertility  excess  to  the  process  of  gene  substitution.  Thus,  a 
species  could  carry  out  one  gene  substitution  on  the  average  every  300 
generations. 

It  is  clear  that  Haldane's  argument  about  the  cost  of  natural  selection  is 
essentially  the  same  as  the  case  of  competitive  selection  though  he  considered 
a slightly  different  situation.  For  a population  not  to  become  extinct  during 
the  process  of  gene  substitution,  there  must  be  a fertility  excess  to  offset 
the  cost.  This  cost  is  exactly  the  same  as  the  accumulated  fertility  excess 
required  in  the  case  of  competitive  selection.  The  only  difference  is  that  when 
there  is  not  enough  fertility  excess  the  population  becomes  extinct  in 
Haldane's  case  (Felsenstein,  1971),  while  in  the  case  of  competitive  selection 
the  population  never  becomes  extinct  unless  k is  less  than  unity  but  simply 
the  rate  of  gene  substitution  is  reduced.  In  practice,  of  course,  it  is  not  always 
easy  to  distinguish  between  the  two  types  of  selection.  Even  the  industrial 
melanism  mentioned  earlier  can  be  argued  to  have  occurred  by  competitive 
selection  against  predators. 

So  far  we  have  assumed  that  the  population  size  is  infinitely  large,  but  all 
natural  populations  are  actually  finite.  The  substitutional  load  or  the  fertility 
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excess  required  in  finite  populations  has  been  studied  by  Kirnura  and 
Maruyama(I969),  Kimura  (1969a),  Ewens(1970),  Kimura  and  Ohta  (1971b), 
and  Felsenstein  (1  972),  using  various  mathematical  models.  Kimura  and 
Ewens  suggest  that  the  fertility  excess  required  in  finite  populations  is  con- 
siderably less  than  that  in  infinite  populations.  Their  argument  is  as  follows: 
at  the  steady  state  of  gene  substitution  at  which  the  introduction  of  new 
advantageous  mutations  into  the  population  and  the  fixation  of  previously 
segregating  alleles  occur  every  generation  at  a constant  rate,  there  are  many 
loci  that  are  transiently  polymorphic  in  the  population.  For  example,  if 
the  number  of  generations  required  for  a gene  substitution  is  1000  generations 
and  the  number  of  gene  substitutions  per  generation  is  1,  as  was  estimated 
from  molecular  data  (cf.  Kimura,  1973),  then  there  will  be  1000  loci  at 
which  gene  substitution  is  proceeding.  If  there  are  two  alleles  at  each  locus, 
the  possible  number  of  genotypes  for  these  1000  loci  is  21 000  as  lO301,  This 
number  is  so  enormous,  that  only  a small  proportion  of  the  possible  geno- 
types will  actually  appear  in  the  population.  Particularly,  those  genotypes 
which  have  a large  number  of  advantageous  (or  disadvantageous)  genes 
would  never  appear  in  practice.  In  other  words,  the  largest  number  of 
advantageous  alleles  that  can  be  possessed  by  an  individual  in  a finite 
population  must  be  much  smaller  than  the  maximum  possible  number.  The 
fertility  excess  required  would  then  be  much  lower  than  that  in  infinite 
populations,  if  population  size  is  controlled  by  outside  factors  and  selection 
is  competitive.  For  example,  Kimura  and  Ohta  (1971b)  show  that  if  popula- 
tion size  is  10s,  selection  coefficients  (s)  are  0.01,  and  the  number  of  gene 
substitutions  per  generation  is  1 , the  individual  carrying  the  largest  number 
of  advantageous  alleles  must  have  about  1.58  times  as  many  offspring  as 
the  average  individual  in  a haploid  population.  The  equivalent  value  for  a 
diploid  population  is  1.92.  This  requirement  is  much  smaller  than  the 
fertility  excess  required  in  infinite  populations. 

However,  there  seems  to  be  a problem  in  the  computation  by  Kimura, 
Ohta,  and  Ewens.  They  compute  the  mean  fitness  of  the  most  fit  individual 
in  a finite  population  after  deriving  the  variance  of  fitness  using  the  model  of 
unlimited  fertility.  If  the  model  of  limited  fertility  is  used  from  the  beginning, 
the  rate  of  change  of  gene  frequency  is  reduced  (Nei,  1973b).  Apparently,  a 
more  careful  study  should  be  made  of  the  fertility  excess  required  in  a 
finite  population.  The  actual  fertility  excess  required  seems  to  be  higher 
than  that  obtained  by  Kimura  and  Ohta. 

The  theory  of  cost  of  natural  selection  strongly  influenced  Kimura  (1968a) 
in  his  development  of  the  neutral  mutation  hypothesis.  Using  the  data  on 


Go  to  CONTENTS 


66  Natural  selection  and  its  effects 

amino  acid  sequences  of  hemoglobin,  cytochrome  c,  etc.  in  diverse  organisms, 
he  computed  the  rate  of  nucleotide  substitution  per  DNA  base  per  year  as 
10“ 10.  Since  the  mammalian  genome  has  some  3.2  x 109  base  pairs,  this 
corresponds  to  a rate  of  gene  (base)  substitution  equal  to  about  0.5  per  year 
per  genome.  He  thought  that  this  rate  is  so  high  compared  with  Haldane's 
computation,  i.e.,  1/300  = 0.003  per  generation,  that  all  of  the  gene  sub- 
stitutions cannot  be  due  to  natural  selection.  In  order  to  explain  the  dis- 
crepancy, Kimura  suggested  that  a majority  of  gene  substitutions  have 
occurred  by  random  fixation  of  neutral  or  nearly  neutral  mutations.  As  will 
be  discussed  in  the  next  chapter,  if  the  product  of  population  size  and 
selection  coefficient  is  much  smaller  than  1,  the  gene  frequency  change  is 
dictated  by  random  genetic  drift  and  no  fertility  excess  is  required. 

As  mentioned  above,  however,  the  fertility  excess  required  for  gene  sub- 
stitution in  finite  populations  seems  to  be  smaller  than  Haldane  and  Kimura 
originally  thought,  though  this  problem  is  not  completely  settled.  Further- 
more, as  will  be  discussed  later,  a large  part  of  the  DNA  of  higher  organisms 
seems  to  be  nonfunctional.  Therefore,  Kimura's  original  argument  is  less 
compelling  at  the  present  time.  Nevertheless,  his  neutral  mutation  hypothesis 
may  be  correct,  and,  in  fact,  there  is  evidence  to  support  this  hypothesis 
(ch.  8). 

4.6  Equilibrium  gene  frequencies 

In  the  foregoing  sections  we  were  mainly  concerned  with  directional  change 
of  gene  frequency  in  populations.  If  there  is,  however,  some  opposing  factor 
such  as  mutation  or  counteractive  selection,  gene  frequency  may  reach  a 
point  at  which  no  change  in  frequency  occurs.  Such  a point  is  called  equilibrium 
gene  frequency.  Theoretically,  there  are  many  different  ways  in  which  such 
a gene  frequency  equilibrium  may  arise.  A detailed  discussion  of  this  topic 
is  given  in  Crow  and  Kimura's  (1970)  book.  In  the  present  book  we  shall 
discuss  only  some  important  cases. 

In  the  classical  theory  of  population  genetics  the  equilibrium  gene 
frequency  was  an  important  subject  of  study.  Until  recently  a majority  of 
genetic  polymorphisms  observed  in  nature  were  thought  to  be  stable  poly- 
morphisms the  sense  that  if  gene  frequency  is  deviated  from  the  equilibrium 
point  by  some  factor,  it  is  brought  back  to  the  original  point  sooner  or  later. 
Particularly  the  stable  polymorphism  due  to  overdominant  selection  was 
regarded  to  be  an  important  source  of  genetic  variation  in  natural  popula- 
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tions  (Dobzhansky,  1951).  This  idea  is  still  maintained  in  a large  school  of 
population  geneticists  (Dobzhansky,  1970).  Nevertheless,  there  are  only  a 
few  cases  in  which  true  overdominancc  has  been  proven,  and  the  recent 
studies  on  protein  evolution  indicate  that  there  must  be  a substantial  amount 
of  transient  polymorphisms  in  natural  populations.  Also,  the  classical  theory 
of  gene  frequency  equilibrium  due  to  the  forward  and  backward  mutations 
between  a pair  of  neutral  alleles  is  now  known  to  be  unrealistic.  At  the 
nucleotide  or  codon  level  new  mutations  are  almost  always  different  from 
the  preexisting  alleles  in  the  population,  so  that  such  an  equilibrium  would 
never  occur  in  natural  populations. 


4.6.1  Mutation- selection  balancefor  deleterious  genes 


Although  at  the  codon  level  almost  any  mutation  is  different  from  the  alleles 
extant  in  the  population,  many  deleterious  mutations  often  result  in  the  same 
or  similar  effect  on  phenotype.  In  this  case  all  the  deleterious  genes  can  be 
treated  as  a single  allele  and  the  deleterious  mutation  can  be  assumed  to 
occur  recurrently.  Since  most  deleterious  mutations  are  selected  against,  the 
gene  frequency  ultimately  reaches  an  equilibrium  point.  Let  us  designate  the 
deleterious  allele  and  its  wild-type  allele  by  A,  and  A,,  respectively,  and  let 
x2  be  the  frequency  of  A„  so  that  the  frequency  of  A,  is  xl  = 1 - x2.  If  the 
fitnesses  of  genotypes  AtAu  AlA2,  and  A2A 2 are  1,  1 - h,  and  1 - s, 
respectively,  the  amount  of  change  in  x2  per  generation  is,  from  (4.10), 

Ax,  = - XiX2[h  + (s  - 2 h)xf\lW,  (4.50) 

where  W = 1 - 2 hxlx2  - sx\.  On  the  other  hand,  the  amount  of  change  in 
gene  frequency  due  to  mutation  is  Ax,  = ux„  where  u is  the  mutation  rate 
from  A,  to  A,.  Therefore,  combining  these  two  effects,  we  have 

Ax,  = uxl  - x1x2[h  + (s  — 2ft)x2}/iy.  (4.51) 

At  equilibrium  Ax2  should  be  0,  so  that 


u = x2[h  + (s  - 2h)x2] 


(4.52) 


approximately,  since  W is  close  to  1 for  a deleterious  gene  at  equilibrium. 

The  equilibrium  gene  frequency  (x2)  can  be  obtained  by  solving  (4.52)  for 
x,.  It  becomes 


- h + Jh2  + 4u(7-  2h) 

2(5  - 2h) 


(4.53) 


In  the  case  of  completely  recessive  genes  h = 0,  so  that 

*2  = VM/5- 


(4.54) 
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If  h is  much  larger  than  yfsu,  the  square  root  term  in  (4.53) can  be  written  as 


h 


approximately.  Therefore,  if  the  degree  of  dominance  of  the  deleterious  gene 
is  sufficiently  large,  we  have 

%2  = u/h  (4.55) 

approximately. 

This  formula  can  also  be  obtained  by  noting  that  if  h is  sufficiently  large, 
selection  against  the  deleterious  gene  occurs  mostly  in  heterozygous  condi- 
tion and  there  appear  virtually  no  recessive  homozygotes  in  the  population. 
Namely,  in  this  case  the  fitnesses  and  frequencies  of  AtAlf  A1A2,  and  A2A2 
can  be  written  approximately  as  follows: 

Genotype  A,A2  A2A2 

Fitness  l 1 — h 1 — s 

Frequency  1 - 2x2  2 x2  — 

Therefore,  the  amount  of  change  in  x2  by  selection  per  generation  is  - hx2 
approximately.  At  equilibrium  this  is  balanced  with  the  gain  by  mutation 
u(  1 - x2)  ~ u,  so  that  u = hx2.  Hence,  (4.55)  follows. 

Formulae  (4.54)  and  (4. 55 ) have  been  used  by  many  authors,  particularly 
in  man  and  Drosophila.  When  these  formulae,  particularly  the  former,  are  to 
be  used,  some  caution  should  be  exercised.  First,  formula  (4.54)  is  correct 
only  in  very  large  populations.  If  population  size  is  smaller  than  the  reciprocal 
of  the  mutation  rate,  the  actual  gene  frequency  is  expected  to  be  smaller  than 
the  value  given  by  this  formula.  This  is  true  also  with  (4.55)  if  h is  close  to  0. 
We  shall  discuss  this  problem  in  ch.  5.  Second,  the  equilibrium  gene  frequency 
of  a recessive  deleterious  gene  is  affected  considerably  by  a small  positive  or 
negative  selection  in  heterozygotes.  In  most  cases  such  a small  heterozygous 
effect  on  fitness  cannot  be  determined  experimentally.  Third,  for  a recessive 
gene  it  takes  a long  time  for  the  equilibrium  to  be  attained  if  it  is  disturbed. 
Particularly  in  human  populations  the  mating  and  migration  patterns  have 
changed  considerably  in  the  last  few  centuries.  Thus,  it  is  possible  that  the 
frequencies  of  many  recessive  deleterious  genes  in  man  are  not  at  equilibrium. 
Fourth,  as  mentioned  earlier,  the  deleterious  genes  at  a locus  are  apparently 
a collection  of  different  alleles  at  the  codon  level.  Although  their  effects  on 
phenotype  are  similar,  their  effects  on  fitness  in  heterozygous  condition  may 
be  different.  For  example,  in  the  /1-chain  of  human  hemoglobin  more  than 
80  different  kinds  of  point  mutations  have  been  recorded.  Many  of  them 
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Estimates  of  gene  frequencies  for  some  genetic  diseases  in  Caucasians. 


Genetic  disease 

Gene 

frequency 

Genetic  disease 

Gene 

frequency 

Dominant 

Recessive 

Achondroplasia 

5 x 10-5 

Albinism 

3 x 10-3 

Retinoblastoma 

5 x 10  "E 

Xeroderma  pigmentosum 

2 X 10-3 

Huntington's  chorea 

5 x 10-* 

Phenylketonuria 

7 x 10-3 

Sex-linked 

Cystic  fibrosis 

2.5  x 10-2 

Hemophilia 

1 X 10~4 

Tay-Sachs  disease 

General 

1 x 10-3 

Muscular  dystrophy 

(Duchenne's  type) 

2 x 10-* 

Ashkenazic  Jews 

1.3  x lO-2 

affect  the  function  of  hemoglobin,  but  the  effect  is  not  the  same  for  all 
mutations. 

Formula  (4.55)  is,  however,  applicable  for  a variety  of  situations,  if  k is 
large.  As  an  example,  let  us  consider  achondroplastic  dwarfism  in  man,  which 
is  caused  by  a single  dominant  gene.  The  fitness  of  heterozygotes  for  this 
gene  has  been  estimated  to  be  1 - h = 0.196  (cf.  Stern,  1973).  In  a survey 
conducted  in  Denmark  ten  heterozygotes  were  found  in  a sample  of  94,075 
newborns.  Eight  out  of  these  ten  heterozygotes  were  fresh  mutations.  Thus, 
the  mutation  rate  is  8/(2  x 94,075)  = 4.25  x 10“ 5 per  generation.  On  the 
other  hand,  the  gene  frequency  (x2)  in  newborns  is  estimated  to  be  10/(2  x 
94,075)  = 0.0000531.  Using  this  value  and  the  estimate  of  fitness,  the  muta- 
tion rate  is  computed  to  be  u = hx2  = 0.0000427  per  generation.  This 
estimate  agrees  quite  well  with  the  direct  estimate  of  mutation  rate,  though 
the  sample  size  is  very  small. 

Human  populations  are  known  to  have  many  different  deleterious  genes 
whose  frequencies  are  low.  McKusick  (1971)  lists  866  distinct  clinical 
syndromes,  each  of  which  can  be  attributed  to  a single-locus  mutation.  The 
frequencies  of  some  of  these  genes  are  given  in  table  4.6.  The  reliability  of  the 
estimates  for  completely  recessive  genes  is  low  for  the  reasons  mentioned 
above.  Because  of  recent  technical  advances,  the  heterozygotes  in  some  of 
these  recessive  genes  can  now  be  detected.  Therefore,  in  the  future  more 
accurate  estimates  of  gene  frequencies  may  be  obtained. 

4.6.2  Balancing  selection 
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If  there  are  two  opposing  forces  of  selection,  gene  frequency  equilibria  may 
arise.  The  simplest  model  of  this  is  overdominant  selection  first  proposed  by 
Fisher  (1922).  Let  the  fitnesses  of  A2AU  AXA2>  and  A2A2  be  1 - s,,  1,  and 
1 - s2.  Then,  the  amount  of  change  in  the  frequency  of  A,  per  generation 
is,  from  (4.15), 

Axt  = - (5j  -F  (4.56) 

where  W=  1 - s1xf  - At  equilibrium,  Ax,  = 0,  so  that  the  equilibrium 
gene  frequency  is 

= sdfci  + si)*  (4.57) 

Using  this  equilibrium  gene  frequency,  (4.56)  may  be  written  as 

Ax,  = U,  + - *i VR  (4.58) 


Therefore,  xl  increases  if  it  is  smaller  than  A,,  while  it  decreases  if  it  is 
larger  than  Scv  Thus,  if  there  is  any  deviation  of  x,  from  the  equilibrium 
gene  frequency,  the  deviation  is  reduced  every  generation,  and  the  gene 
frequency  eventually  reaches  the  equilibrium  value.  This  type  of  equilibrium 
is  called  stable  equilibrium.  Once  the  gene  frequency  reaches  the  stable 
equilibrium,  it  will  stay  there  forever  unless  the  selection  coefficients  change. 
It  is  also  noted  that,  unlike  the  case  of  mutation- selection  balance,  the 
equilibrium  gene  frequency  can  be  high  and  thus  a relatively  small  number  of 
overdominant  loci  may  create  a large  amount  of  genetic  variability. 

Overdominant  selection  may  occur  also  in  competitive  selection.  In  this 
case,  putting  Ax,  = 0 in  (4.43),  we  have 


5 - *■ 1 ~ + 

2<Si  + - Sj) 

The  above  equilibrium  is  stable,  since 


0 < < 1 


<4.59) 


- 1 < _ _ sl(l  - + 4S;S>  < 0.  (4.60) 

Formula  (4.59)  does  not  hold  when  s2  = - + s3.  In  this  case  Ax,  = 

XiX2(—  + x3Sj),  so  that 

JEi  = + s3).  (4.61) 

Therefore,  if  there  is  overdominance,  competitive  selection  also  creates  a 
stable  equilibrium  of  gene  frequency. 
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Because  of  its  simplicity,  the  overdominance  model  has  been  used  by 
many  authors  to  explain  genetic  polymorphisms  in  natural  populations.  As 
mentioned  earlier,  however,  there  are  not  many  cases  in  which  overdominance 
has  been  proven.  An  oft-cited  example  of  overdominance  is  the  polymor- 
phism of  chromosome  inversions  in  Drosophila  pseudoobscura.  In  the  third 
chromosome  of  this  species  there  are  many  different  gene  arrangements  in 
natural  populations.  Since  there  is  virtually  no  recombination  within  the 
inverted  segment  in  heterozygotes,  each  gene  arrangement  behaves  just  like 
a single  gene.  Wright  and  Dobzhansky  (1946)  studied  the  frequency  change 
of  gene  arrangement  Standard  (ST)  and  Chiricahua  (CH)  in  a laboratory 
population  and  showed  that  the  ST  chromosome  eventually  reaches  an 
equilibrium  frequency  of  about  70  percent.  From  the  chromosome  frequency 
changes  over  generations,  they  estimated  the  genotype  fitnesses  as  follows: 
Genotype  ST/ST  ST/CH  CH/CH 

Relative  fitness  1-0.3  1 1 - 0.7 

The  expected  equilibrium  frequency  of  the  ST  chromosome  is  therefore 
0.7/(0.3  + 0.7)  = 0.7,  which  agrees  quite  well  with  the  observed  value. 
Similar  experimental  results  were  also  obtained  by  Dobzhansky  and 
Pavlovsky  (1953)  and  others. 

However,  this  sort  of  overdominance  at  the  chromosome  level  does  not 
necessarily  mean  overdominance  at  the  gene  level,  since  the  inverted  segment 
of  a chromosome  generally  includes  a large  number  of  genes  and  the  genes 
in  this  segment  are  completely  isolated  from  those  of  other  chromosomes. 
Suppose  that  an  inversion  chromosome  has  genes  aBc  in  the  inverted 
segment  and  its  ancestral  chromosome  has  Ab  C,  where  capital  and  small 
letters  denote  wild-type  and  deleterious  alleles,  respectively.  Then,  the 
inversion  heterozygote  aBc/AbC  should  have  a higher  fitness  than  the  two 
homozygotes  aBc  I aBc  and  AbC/AbC,  if  the  wild- type  alleles  are  completely 
or  partially  dominant  over  deleterious  genes.  This  apparent  overdominance 
is  often  called  associative  overdominance  (Frydenberg,  1963).  Associative 
overdominance  is  expected  to  occur  frequently  in  laboratory  experiments, 
since  different  gene  arrangements  used  in  these  experiments  are  often  derived 
from  a single  or  a few  individuals  in  natural  populations  (Ohta,  1971). 

If  this  is  the  case,  such  an  inversion  polymorphism  would  not  occur  in 
natural  populations,  since  the  fixation  of  a deleterious  gene  in  the  inversion 
or  standard  chromosomes  of  the  whole  population  is  almost  impossible. 
Furthermore,  for  an  inversion  polymorphism  to  be  stable  in  nature,  there 
must  be  cumulative  overdominance  (Dobzhansky's  coadaptation  of  genes) 
at  more  than  two  loci,  as  shown  by  Haldane  (1957b).  A single  locus  over- 
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dominance  is  not  sufficient.  Interestingly,  inversion  polymorphisms  in 
natural  populations  of  Drosophila  pseudoobscura,  which  were  once  thought 
to  be  stable,  now  appear  to  be  transient,  since  the  chromosome  frequencies 
are  slowly  changing  (Dobzhansky  et  al.,  1966).  For  example,  the  frequency 
of  the  CH  chromosome  in  some  areas  of  California  declined  from  about 
50  percent  to  about  5 percent  during  the  25  years  from  1940. 

The  number  of  generations  per  year  in  this  organism  would  be  about  8. 
Thus,  the  average  change  in  chromosome  frequency  per  generation  is  roughly 
0.2  percent.  This  is  not  small  for  a gene  frequency  change.  In  some  other 
areas,  however,  the  amount  of  change  is  much  smaller  - about  10  times  lower. 
This  slow  change  of  chromosome  frequency  is,  however,  expected  to  occur  if 
the  selective  advantage  of  newly  arisen  inversions  is  conferred  by  a combina- 
tion of  dominant  favorable  alleles  in  the  inverted  segment  (Nei  et  al.,  1967; 
Kimura  and  Ohta,  1970).  Many  species  of  Hawaiian  Drosophila  carry  various 
inversion  chromosomes,  but  even  closely  related  species,  which  have  diverged 
probably  less  than  200,000  years  ago,  often  have  different  inversion  poly- 
morphism~(Carson,  1970).  This  fact  also  suggests  that  inversion  poly- 
morphism-are largely  transient  rather  than  stable  (see  ch.  6 for  further 
discussion). 

Even  in  noninversion  chromosomes  close  linkage  of  genes  makes  it 
difficult  to  detect  single  gene  overdominance.  Mukai  and  Burdick  (1959) 
established  a strain  of  Drosophila  melanogaster  in  which  only  a lethal  gene 
and  possibly  its  very  closely  linked  genes  are  segregating.  The  behavior  of  the 
lethal  gene  in  the  first  16  generations  in  a laboratory  population  showed  a 
perfect  pattern  of  overdominance,  the  equilibrium  gene  frequency  being 
about  0.4.  Their  examination  of  gene  frequency  in  later  generations,  how- 
ever, indicated  that  the  seemingly  equilibrium  gene  frequency  was  not  stable, 
and  the  gene  frequency  gradually  declined  down  to  about  0.1  in  the  71st 
generation  (Mukai  and  Burdick,  1961).  Clearly,  the  apparent  overdominance 
observed  in  early  generations  was  caused  by  a set  of  genes  closely  linked  to 
the  lethal  gene  (associative  overdominance)  and  the  initial  linkage  dis- 
equilibrium was  gradually  broken  down  by  recombination.  Similar  but  less 
rigorous  experiments  have  been  repeatedly  reported  before  and  after  Mukai 
and  Burdick's.  The  apparent  overdominance  observed  with  some  marker 
genes  in  inbred  strains  or  isogenic  lines  (Wills  and  Nichols,  1971 ; Sing  et  al., 
1973)  can  also  be  explained  by  associative  overdominance  (Yamazaki,  1972). 
A similar  associative  overdoininance  may  be  invoked  to  explain  the  hetero- 
zygote advantage  for  the  black  locus  in  the  flour  beetle  given  in  fig.  4.1, 
though  no  detailed  study  has  been  made. 
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Nevertheless,  there  seem  to  be  some  cases  of  genuine  overdominance.  A 
good  example  is  the  sickle  cell  anemia  gene  in  African  black  populations. 
This  anemia  is  caused  by  the  abnormal  hemoglobin  Hb  S.  The  [I-chain  of  the 
normal  hemoglobin  A has  glutamic  acid  at  position  6.  In  hemoglobin  S this 
amino  acid  has  been  replaced  by  valine  (Ingram,  1963).  The  homozygotes  for 
the  Hb  S gene  are  almost  lethal  in  Africa  but  the  gene  frequency  is  as  high 
as  10  to  20  percent  in  some  areas.  The  prevalence  of  this  gene  is  associated 
with  a high  endemic  incidence  of  malaria.  Allison  (1955)  showed  that  the 
heterozygotes  for  the  Hb  S gene  are  more  resistant  to  malaria  than  normal 
homozygotes  and  thus  have  a higher  fitness  than  both  homozygotes.  This 
was  later  confirmed  by  studies  on  mortality  due  to  malaria  (Allison,  1964; 
Motulsky,  1964).  It  seems  that  in  malaria-endemic  areas  the  sickle  cell 
heterozygotes  have  a selective  advantage  of  about  10  to  20  percent  over 
normal  homozygotes. 

There  are  several  other  mutant  genes  which  apparently  show  heterozygote 
advantage  due  to  increased  resistance  to  malaria.  The  genes  for  hemoglobin 
variants  Hb  C (Glu  ->  Lys  at  position  6 of  the  /?-chain),Hb  E (Glu  — Lys  at 
position  26  of  the  /?-chain),  and  thalassemia  (reduced  production  of  hemo- 
globins), which  also  cause  anemia  in  homozygous  condition,  all  show  a high 
frequency  in  malaria-endemic  areas  (Livingstone,  1967).  Furthermore,  a 
mutant  gene  which  induces  the  deficiency  of  the  enzyme  glucose-6-phosphate 
dehydrogenase  (G6PD)  is  also  frequent  in  malarial  areas.  This  G6PD 
deficiency  gene  is  located  on  the  X chromosome.  In  this  connection  it  is 
worth  noting  that  well  before  anyone  studied  the  relationship  between  these 
genes  and  malaria,  Haldane  (1949)  had  suggested  that  the  frequency  of  the 
thalassemia  gene  is  too  high  to  be  explained  by  the  mutation-selection 
balance  and  its  polymorphism  is  probably  maintained  by  the  heterozygote 
advantage  due  to  resistance  to  malaria. 

Genuine  overdominance  need  not  be  confined  to  deleterious  genes  but  the 
overdominance  for  nondeleterious  genes  is  not  easy  to  prove.  There  is  a 
group  of  geneticists  who  believe  that  the  polymorphisms  in  the  ABO,  MN, 
and  Lewis  blood  groups  in  man  are  maintained  by  overdominance.  This  view 
is  somewhat  strengthened  if  we  note  that  the  polymorphisms  exist  not  only 
in  man  but  also  in  some  apes  (chimpanzee,  gorilla,  and  orangutan)  and 
monkeys  (Wiener  and  Moor- Jankowski,  1971).  An  intensive  study  on  the 
relative  fitnesses  of  different  genotypes  in  these  blood  groups  has  been  done 
by  Morton  and  his  associates  (Morton  and  Chung,  1959;  Chung  and  Morton, 
1961;  Morton  et  al.,  1966).  Yet,  they  have  not  confirmed  any  significant 
heterozygote  advantage. 
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2)  Overdominance  with  epistasis 

Overdominance  is  an  interaction  between  two  alleles  at  a locus,  while 
epistasis  is  an  interaction  between  alleles  of  two  different  loci.  Thus,  one 
might  suspect  that  epistasis  itself  is  sufficient  to  maintain  stable  polymorphism 
without  overdominance.  As  far  as  concerned  with  constant  fitness,  this  is 
not  the  case.  For  maintaining  polymorphism  there  must  be  overdominance 
at  least  at  a locus  but  not  necessarily  at  both  loci.  Following  P.  M.  Sheppard's 
suggestion,  Kimura  ( 1956)  produced  a mathematical  model  in  which,  at  the 
first  locus,  alleles  A,  and  A,  are  maintained  by  overdominance,  while,  at  the 
second  locus,  alleles  B,  and  Bz  interact  with  A,  and  A,  in  such  a way  that 


A,  is  advantageous  in  combination  with  Bl  but  disadvantageous  in  com- 
bination with  B2  and  the  situation  is  reversed  for  the  A,  allele.  In  this  case 
the  B locus  polymorphism  may  be  maintained  without  overdominance.  More 
specifically,  Kimura's  model  assumes  the  following  genotype  fitnesses. 

AtA  t 

A1A2  A 2A2 

b2b2 

1 + s 
1 

1 — s 

l + t 1 - ,5' 

l + 1 1 

l + t 1 + S 

where  0 < s < t.  Therefore,  Wi  (i  = 

1,  2,  3,  4)  and  IT  in  (4.27)  are  given  by 

fVj  = I + X]5  +■  JfjJ 

l + XAl , 

wi  = t - XjS  4-  x3i 

! -|- 

Wj  = 1 + + X2\ 

- 

W4  = t + Jfjf  + X2i 

' + ^45, 

w = 1 +(x*~  Xl- 

- x\  + xl)i  + 

2(XlX1  + XtXA  + X2X3  + X2Xjl. 

The  equilibrium  chromosome  frequencies  are  obtained  by  putting  A Xt  = 0 in 
(4.27).  They  become 

8,  = X4  _ (1/2  - 0 + J\[a  + /;2)/2.  (4.62a) 

*1  = x,  = (1/2  + fi~  -Jm  + (4.62b) 

with 

6 =.  (./I/4  + [i1  - y))/2,  (4.63) 

where  />  = fl  + t)r/s.  It  is  noted  that  the  frequencies  of  genes  A , and  B, 
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are  both  0.5.  If  r = 0,  then  fi  = 0,  so  that  X , = X4  = 0.5  and  X3  = X4  = 0. 
Namely,  there  are  only  two  types  of  chromosomes,  A {BX  and  /!ZB„  in  the 
population.  If  r > 0,  then  all  four  types  of  chromosomes  appear.  Kimura 
has  shown  that  this  equilibrium  is  stable  only  when  r is  smaller  than  ( t 2 — 
s2)l[ft(\  + t)]. 

If  there  is  overdominant  selection  for  both  loci,  there  may  be  several 
stable  or  unstable  equilibria  for  a given  set  of  genotype  fitnesses.  This 
problem  has  been  studied  by  Wright  (1952),  Lewontin  and  Kojima  (1960), 
Bodmer  and  Parsons  (1962)  and  several  others.  Let  us  consider  the  following 
simple  fitness  model: 


A tA  f A jA 2 


A2A2 


B,B,  (I-jXI-0 

BjB-x  I - J 

b2b2  (i  - m - o 


i - / (i  - jxi  - /) 

1 3 - s 

l - r (a  - j)(i  - t ) 


Clearly,  the  fitnesses  at  the  two  loci  are  multiplicative  and  symmetric  about 
heterozygotes;  s and  t are  the  selection  coefficients  for  either  homozygotes 
at  the  A and  B loci,  respectively.  Multiplicative  fitness  is  expected  to  occur  if 
selections  due  to  the  two  loci  are  independent.  It  involves  epistatic  interaction 
since  there  are  deviations  in  genotype  fitnesses  from  additivity  between  two 
loci.  By  using  (4.27),  it  can  be  shown  that  there  are  three  equilibria  (Bodmer 
and  Felsenstein,  1967;  Kimura  and  Ohta,  1971b).  Namely, 

<«*) 

= 1/4,  (4,64c) 


while  X 2 = = 1/2  - 8 , for  each  of  the  above  equilibria.  Note  that  the 

gene  frequencies  of  A , and  B1  are  both  0.5  in  all  cases.  The  first  two  equilibria 
with  5 = ± (1/4)X/{1  - (4 r/st)}  are  stable  only  when  r < st/4.  Otherwise, 
the  system  will  move  to  the  third  equilibrium.  In  practice  s and  t would  rarely 
exceed  0.1.  If  s = t = 0.1,  r must  be  smaller  than  0.0025  for  the  first  two 
equilibria  to  be  stable.  Therefore,  only  when  the  recombination  value  is 
extremely  small,  do  the  equilibria  with  linkage  disequilibria  become  im- 
portant. 

Karlin  and  Feldman  (1969,  1970)  (see  also  Li,  1971)  studied  a general 
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symmetric  fitness  model  with  two  loci  each  with  two  alleles.  This  model 
generally  permits  three  symmetric  equilibria  in  the  sense  that  $ L = X4  and 
X 2 = X3.  In  addition  to  these  symmetric  equilibria,  they  could  show,  some- 
what surprisingly,  that  there  are  several  asymmetric  equilibria  under  certain 
combinations  of  genotype  fitnesses  and  recombination  value  and  the  total 
number  of  equilibria  may  be  as  large  as  seven  for  a given  fitness  set.  However, 
the  stability  of  these  asymmetric  equilibria  requires  several  severe  conditions 
about  genotype  fitness  and  recombination  value,  so  that  it  appears  to  be 
easily  upset  in  real  natural  populations,  where  environmental  conditions 
never  stay  constant  and  random  genetic  drift  due  to  finite  size  cannot  be 
neglected. 

In  general,  if  two  interacting  loci  are  closely  linked  and  there  is  over- 
dominance at  both  loci,  there  arise  stable  equilibria  with  D # 0.  If  the  two 
loci  are  very  tightly  linked,  they  behave  just  like  a single  locus,  forming  the 
so-called  supergene  (Ford,  1964).  On  the  other  hand,  if  the  two  loci  are 
loosely  linked,  there  occur  stable  equilibria  with  D ^ 0,  Furthermore,  if  a 
population  is  subdivided  into  several  random  mating  units,  stable  linkage 
disequilibria  may  arise  without  any  epistatic  selection  (Li  and  Nei,  1974). 


3)  Other  types  of  balancing  selection 

Theoretically,  there  are  several  other  types  of  balancing  selection  which  may 
produce  stable  polymorphism  with  intermediate  gene  frequency.  Wright  and 
Dobzhansky  (1946)  showed  that  their  experimental  data  on  the  frequency 
changes  of  inversion  chromosomes  can  also  be  explained  by  frequency- 
dependent  selection.  Their  model  is  as  follows: 


Genotype  Frequency 

A i A i 


Fitness 
1 + a - bx j 


AVA 2 2^,(1  - 1 

A2Az  (1  - v i ) J I — a + bxx 

Namely,  the  fitness  of  A,  A,  decreases  as  the  gene  frequency  (x,)  of  A , 
increases,  while  that  of  A2A  2 increases  with  increasing  x,.  Therefore,  the 
gene  frequency,  x,,  reaches  a stable  equilibrium.  The  amount  of  change  of 
gene  frequency  per  generation  is  given  by 


Jjf]  XL(1  — 


where  W = I - (a  - bx , )( 1 - 2.x,).  Therefore, 


£4.65) 
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Wright  and  Dobzhansky’s  estimates  of  a and  b in  their  case  are  0.902  and 
1.288,  respectively,  so  that  = 0.7,  as  obtained  earlier. 

In  recent  years  many  other  models  of  frequency-dependent  selection  have 
been  developed  (e.g.  Clarke  and  O’ Donald,  1964;  Wright,  1969).  Experimental 
data  which  support  the  frequency-dependent  selection  model  have  also 
increased  (ch.  6).  Yet,  the  biological  mechanism  of  frequency-dependent 
selection  is  not  well  understood.  It  is  possible  that  some  seemingly  frequency- 
dependent  selection  is  actually  caused  by  loci  closely  linked  to  a marker  gene 
or  by  subtle  environmental  changes  in  the  process  of  population  changes. 
More  studies  on  the  biological  mechanism  of  frequency-dependent  selection 
are  required. 

Levene  (1953)  showed  that  stable  polymorphism  may  occur  when  a 
population  occupies  a wide  variety  of  niches  among  which  the  selection 
coefficient  for  an  allele  varies.  Several  similar  models  are  reviewed  by  May- 
nard Smith  (1970).  In  these  models,  however,  rather  severe  conditions  are 
required  for  the  equilibrium  to  be  stable.  Under  certain  circumstances, 
stable  polymorphism  may  also  arise  when  selection  coefficients  vary  in 
different  generations  (Haldane  and  Jayakar,  1963b;  Hartl  and  Cook,  1973; 
Gillespie  and  Langley,  1974).  Here  again,  however,  a severe  condition  is 
required.  Particularly  in  finite  populations  the  'power  of  holding  poly- 
morphism~is  very  weak  (Hedrick,  1974). 
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In  the  foregoing  chapter  we  used  a deterministic  model  to  describe  the 
change  of  gene  frequency  by  natural  selection.  This  approach  is  equivalent 
to  assuming  that  the  population  size  is  so  large,  that  there  is  no  sampling 
error  in  the  process  of  gene  frequency  change  from  one  generation  to  the 
next.  The  number  of  breeding  individuals  in  natural  populations  is,  however, 
often  quite  small.  This  is  true  even  if  the  total  population  of  a species  is  very 
large,  since  the  distance  an  organism  migrates  in  one  generation  is  generally 
very  small  compared  with  the  total  territory  of  the  entire  population  and 
actual  breeding  occurs  among  a limited  number  of  individuals.  If  the  number 
of  breeding  individuals  is  small,  the  gene  frequency  change  from  one  genera- 
tion to  the  next  is  subject  to  sampling  error.  Namely,  gene  frequency  does 
not  change  uniquely  from  one  value  to  the  other,  but  the  change  occurs  only 
with  a certain  probability.  This  sort  of  probabilistic  change  is  called  stochastic 
change.  In  population  genetics  this  stochastic  change  is  often  referred  to  as 
random  genetic  drift.  The  stochastic  change  of  gene  frequency  may  also  occur 
due  to  random  fluctuation  of  selection  intensities  from  generation  to  genera- 
tion. In  general,  a stochastic  model  is  more  realistic  than  a deterministic, 
and  the  latter  is  merely  a special  case  of  the  former.  Of  course,  the  mathe- 
matics of  stochastic  models  is  more  complicated,  and  exact  solutions  are 
often  difficult  to  obtain.  Nevertheless,  after  the  pioneering  work  of  Fisher 
and  Wright,  many  important  problems  have  been  solved  in  terms  of  stochastic 
models.  The  stochastic  theory  of  population  genetics  seems  to  be  particularly 
important  in  the  interpretation  of  data  on  molecular  polymorphism  and 
evolution  that  are  now  rapidly  accumulating. 

In  the  present  chapter  we  will  study  the  current  theory  of  stochastic 
changes  of  gene  frequency  which  is  relevant  to  the  study  of  molecular 
population  genetics  and  evolution. 
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5.1  Stochastic  change  of  gene  frequency:  discrete  processes 

5.1.1  Markov  chain  methods 

If  a mutation  occurs  in  a population,  the  initial  survival  of  the  mutant  gene 
depends  largely  on  chance,  whether  it  is  selectively  advantageous  or  not  or 
whether  the  population  size  is  large  or  not.  This  can  be  seen  in  the  following 
way.  Let  A,  and  A,  be  the  mutant  and  its  allelic  gene  in  a population.  In  a 
diploid  organism  the  mutant  gene  appears  first  in  heterozygous  condition 
(AXA2).  In  a dioecious  organism  this  individual  will  mate  with  a wild-type 
homozygote  (A  2A2).  The  mating  A,  A,  x A 2A2,  however,  may  not  produce 
any  offspring  for  some  biological  reason  other  than  the  effect  of  the  A,  gene. 
For  example,  the  mate  A2A2  may  be  sterile  by  chance.  (In  man,  5 ~ 10 
percent  of  marriages  are  infertile.)  Then,  the  mutant  gene  will  disappear  in 
the  next  generation.  The  survival  of  the  mutant  gene  is  not  assured  even 
i£  AlA2  x A 2A 2 produces  some  offspring.  This  is  because  in  the  offspring 
the  A,  A 2 genotype  will  appear  only  with  a probability  of  1/2.  Thus,  if  two 
offspring  are  born  from  this  mating,  the  chance  that  no  A1A2  will  appear  is 
0.25. 

Let  us  now  study  this  problem  in  more  detail.  Consider  a random  mating 
population  of  a monoecious  diploid  organism.  We  assume  that  each  in- 
dividual produces  a large  number  of  offspring  and  that  exactly  N of  these 
survive  to  maturity.  Let  x be  the  frequency  of  mutant  gene  A,  among 
gametes  produced  in  a generation.  The  expected  frequencies  of  genotypes 
AXAX,  AxA2,  and  A2A2  after  fertilization  are  then  given  by  x2,  2x(l  - x), 
and  (1  - s)2.  respectively.  We  now  consider  selection  with  constant  fitness, 
and  let  the  fitnesses  of  A,  A,,  AXA2,  and  A2A2  be  1 + s,  1 + h,  and  1, 
respectively.  After  selection,  therefore,  the  gene  frequency  of  A,  changes 
from  x to 


*„*{!  + ™ + *0  - *)}  fS  u 

1 +■  2Ax(l  - x)  + sxr 

The  number  of  individuals  which  survive  to  maturity  is  N by  definition.  We 
assume  that  2N  genes  carried  by  these  N individuals  is  a random  sample  from 
the  gene  pool  after  selection,  neglecting  the  fuel  that  the  actual  survivors 
are  genotypes  rather  than  genes.  It  is  known  that  this  assumption  does  not 
affect  the  result  appreciably  unless  the  population  size  is  extremely  small. 
Since  the  frequency  of  A , among  the  gene  pool  after  selection  is  £ and  2N 
genes  are  chosen  at  random  from  the  gene  pool,  the  number  of  A , genes, 
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among  the  adults  may  vary  from  0 to  2N.  The  probability  that  the  number 
of  A,  genes  becomes  / is  given  by  the y'-th  term  of  the  binomial  expansion 
of  [c  + (I  - £)]**-  That  is, 

r(i)  - Wm  - (5-2) 

In  this  case  the  gene  frequency  is  of  course  given  by  x'  = //2Ar,  and  the  mean 
( M(x' ))  and  variance  ((/(a'))  of  a'  are 

l'(V)  = -1,-“  T til) 

It  is  clear  that  the  mean  gene  frequency  is  the  same  as  x if  there  is  no  selection, 
since  £ = x in  this  case. 

If  x'  = 0,  there  are  no  longer  Al  genes  in  the  population,  and  in  the 
subsequent  generations  no  change  of  gene  frequency  occurs.  On  the  other 
hand,  if  x'  = l,  A,  genes  are  fixed  in  the  population,  and  again  no  change 
in  gene  frequency  occurs  in  the  subsequent  generations.  However,  if  0 < 
x'  < 1,  again  selection  and  random  sampling  of  genes  occur  in  the  next 
generation.  This  process  continues  until  the  A,  gene  is  lost  or  fixed  in  the 
population. 

Mathematically,  this  process  is  called  a Markov  chain.  If  there  are  N 
individuals  in  a population,  there  are  2N  + 1 possible  gene  frequency  classes, 
i.e.  0,  1/22V,  2/2N,  2N/2N,  These  classes  are  called  states  in  probability 

theory.  We  call  the  gene  frequency  class  i/2N  state  i and  denote  by  f{ x)  the 
probability  that  the  gene  frequency  is  at  state  i at  the  t-th  generation,  where 
x = i/2N.  We  have  already  seen  that  when  the  gene  frequency  at  a generation 
is  x,  the  probability  that  the  gene  frequency  becomes  x'  in  the  next  generation 
is  given  by  (5.2).  Namely,  this  is  the  probability  that  the  number  of  A , genes 
in  the  population  changes  from  i = 2Nx  to  j = 2Nx'.  This  is  called  the 
transition  probability  from  state  i to  state  j,  and  we  now  denote  this  by  p{  j. 
Then,  if ffx)  is  given,  we  can  easily  obtain  ft+  ^x)  by  the  following  formulae. 


/i+i(0) 

— r,>.., /V"  -t-  pi#!, ) ) + ■■ 

■■  + P2(Vhp/4(l) 

f‘"  (w) 

1 = Po.,m  + P,.J,  (4r)  + ■ 

'■  + PiNtlfiO) 

(5.4) 

= Pv.lff/M  + + - + Pif r,l(v/i{l). 
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If  we  use  matrix  notation,  the  above  simultaneous  equations  may  be 
expressed  in  a simpler  form.  Let  f,  be  the  column  vector  of  state  probabilities 
/r(0),/t(l/2AO>  ...,/((!),  and  P be  the  following  matrix 


Po.i i 

Pi,0 

■ ■ PlN,Q 

Po,1 

Pi.t 

■■  Pi, ¥.5 

P = 


LfttJIt  Pt.  ZH  ■■■ 

Ordinarily,  the  matrix  of  transition  probabilities  is  defined  as  P = i 
but  the  above  transposed  form  of  definition,  i.e.,  P = ff  ;j}'  is  algebraically 
a little  more  convenient  in  the  present  case.  At  any  rate,  the  equation  (5.4) 
may  then  be  written  as 

f1+]=Pft-  (5.5) 

Therefore,  the  probability  distribution  of  gene  frequencies  at  the  t-th 
generation  is  given  by 

fL  - p%.  my 

where  f0  is  the  initial  probability  distribution.  Matrix  algebra  indicates  that 
if  P is  written  as  QAQ1,  where  A is  the  diagonal  matrix  of  eigenvalues  and 
Q is  the  matrix  of  the  corresponding  eigenvectors,  then  P1  = QA*Q_I. 
Thus,  the  general  solution  for  f,  may  be  obtained.  Unfortunately,  however, 
it  seems  to  be  very  difficult  to  get  an  explicit  expression  for  L 

in  the  present  case,  though  the  eigenvalues  for  the  case  of  neutral  genes  have 
been  worked  out  (Feller,  1951). 

For  a small  population,  however,  it  is  possible  to  get  f,  by  using  a high- 
speed computer.  In  this  case  either  (5.4)  or  (5.6)  may  be  used.  One  of  such 
examples  is  given  in  fig.  5.1,  where  N = 10  and  no  selection  (h  = 0 and 
s = 0)  are  assumed.  The  initial  gene  frequency  was  0.5,  so  that  /o(*)  = 1 
for  v = 0.5  but  /0(v)  = 0 for  all  other  states. 

In  the  first  generation  gene  frequency  is  distributed  as  a binomial  variate 
with  mean  0.5  and  variance  (Q.5)*/20  = 0.0125.  In  the  subsequent  genera- 
tions the  distribution  becomes  flatter  and  flatter,  and  by  the  20th  generation 
it  becomes  virtually  uniform  except  for  the  terminal  (x  = 0 and  x = I) 
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Fig.  5.1.  Probability  distributions  of  gene  frequencies  under  random  mating  in  a finite 
population.  Population  size  is  10  and  the  initial  gene  frequency  is  0.5.  No  selection  is 
assumed. 


and  a few  subterminal  classes.  By  this  time  gene  A,  is  lost  from  or  fixed  in 
the  population  with  probability  about  0.5.  After  this  generation,  the  shape 
of  the  probability  distribution  of  gene  frequency  among  unfixed  classes 
remains  virtually  the  same,  though  the  absolute  probability  of  each  gene 
frequency  class  is  reduced  at  a rate  of  1/(2 N)  = 0.05  in  every  generation. 
The  probabilities  of  classes  x = 0 and  x = 1 gradually  increase  and  eventually 
become  0.5  when  gene  A,  is  completely  lost  or  fixed.  In  the  present  case 
there  is  no  selection,  so  that  the  mean  gene  frequency  is  0.5  throughout  the 
process  of  gene  frequency  changes. 

In  the  study  of  evolution  it  is  important  to  know  the  probability  of 
fixation  of  an  advantageous  mutant  gene.  This  can  also  be  studied  by  using 
(5.6).  An  example  is  given  in  table  5.1,  where  the  fitnesses  of  A 1A1,  A,  A,, 
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Table  5.1 


Probabilities  of  fixation  and  loss  of  a mutant  gene  (A  1)  in  a population  of  size  N = 10. 
The  fitnesses  of  AlAi,  A1A2,  and  A2A 2 are  assumed  to  be  1,  0.9,  and  0.8.  The  initial  gene 
frequency  is  assumed  to  be  1/2N  = 0.05. 


Generation 

1 

2 

3 

10 

50 

00 

m 

6 x I0-HB 

$ x ID” 

5 x lO* 

3 x 10J 

0.1540 

0.1755 

m 

Gr32« 

0.4694 

0.53M 

0.72‘K 

D.84JI 

0,8245 

and  A2A2  are  assumed  to  be  1,  0.9,  and  0.8.  The  population  size  is  again  10 
but  the  initial  frequency  is  1/(2  A)  = 0.05.  It  is  seen  that  the  probability 
of  fixation  is  very  low  in  early  generations  but  gradually  increases  to  reach 
0.1755  eventually.  If  there  were  no  selection,  the  gene  would  have  been 
fixed  with  probability  l/(2./V)  = 0.05.  So,  selection  has  increased  the  prob- 
ability of  fixation  by  0.1255,  but  the  gene  has  still  been  lost  from  the 
population  with  probability  0.8245. 

So  far  we  have  considered  the  stochastic  change  of  gene  frequencies  due 
to  finite  population  size.  As  mentioned  earlier,  however,  the  stochastic 
change  may  also  occur  by  random  fluctuation  of  selection  intensities  in 
different  generations.  This  problem  has  been  studied  by  Wright  (1948a), 
Kimura  (1954,  1962),  Ohta  (1972a),  Jensen  and  Poliak  (1969),  Gillespie 
(1973)  and  others.  The  effect  of  this  factor  is  to  spread  the  gene  frequency 
distribution,  similar  to  that  of  finite  population  size.  With  certain  mathe- 
matical models,  however,  an  effect  to  retard  the  fixation  of  genes  may  be 
generated,  though  the  biological  validity  of  such  models  is  disputable. 

5.1.2  Variance  of  gene  frequencies  and  heterozygosity 

We  have  seen  that  one  of  the  properties  of  random  genetic  drift  is  to  spread 
the  gene  frequency  distribution  as  generation  proceeds.  In  the  absence  of 
selection  this  property  can  be  studied  by  a simple  parameter,  variance  of 
gene  frequencies.  To  make  our  model  concrete,  consider  a large  number  of 
populations  of  equal  size  N,  in  each  of  which  random  mating  occurs.  We 
assume  that  the  initial  gene  frequency,  p,  is  the  same  for  all  populations. 
If  there  is  no  selection,  the  probability  that  the  gene  frequency  of  A,  in 
the  first  generation  becomes  x = ififLN)  is 


K*>  = flVfl  - p)2N~l 
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from  (5.2).  This  probability  is  equal  to  the  relative  frequency  of  populations 
that  have  gene  frequency  x.  Therefore,  the  mean  and  variance  of  x among  all 
the  populations  arc 


X = £(.v)  » p , 

(5.7a) 

_ _ K1  - p) 

E(X  ~P)  IN  ' 

(5.7b) 

respectively.  In  the  next  generation  the  same  random  process  operates  for 
each  gene  frequency  class  x in  the  first  generation.  Therefore,  letting  x'  be 
the  gene  frequency  in  the  second  generation,  we  have 

r = £(*')  = El{£2(jO]  = E,(x)  - p, 

where  E,  and  E2  denote  expected  value  operators  in  the  first  and  second 
generations,  respectively.  Clearly,  the  mean  of  x'  is  the  same  as  that  of  x. 
The  variance  of  x'  is  computed  in  the  following  way. 

VK.  « E(x  - p}* 

= - x)  + (x  - p)}J 

“ E,£i{(xJ  - x}1  + 2(x'  - x)(x  - p)  + (x  - 

since  E2{x  - x)2  = x(l  - x)/(2N)  and  E2{x'  - x)  = 0.  Noting  that 
Et{x(l  - x)}=  Efx)  - Efx  - pf  - p2 


pi  i - p) 
2N 


It  is  now  obvious  that  if  the  same  process  continues  for  t generations,  the 


21V 


---ii—V8! 


pO  - p)  J 

2 N 
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mean  ( x, ) and  variance  (V,)  of  the  gene  frequency  in  the  t-th  generation  are 
given  by 


'.-'VHt'-ir*  -4 

-XI  ('--sr)}' 


(5.8a) 

'ir) + *} 

(5.3b) 


Therefore,  the  mean  gene  frequency  remains  constant  for  all  generations, 
while  the  variance  gradually  increases  as  t increases.  At  t = oo  the  variance 
becomes  p(  1 - p).  This  corresponds  to  the  case  of  complete  fixation  of 
alleles.  Since  we  have  assumed  no  selection  and  no  mutation  in  the  present 
case,  alleles  A,  and  A,  are  eventually  fixed  in  the  population  with  proba- 
bilities p and  1 - p,  respectively.  The  variance  of  gene  frequency  after 
fixation  of  these  alleles  is,  therefore,  p . 1 2 + (1  - p)  • 0 2 - p2  = p(  1 - p). 

Wright  (1951,  1965)  has  called  the  ratio  (^ST)  of  V,  to  p(  1 - p)  the 
fixation  index.  Clearly, 


Fst  = VJW  - rt] 


* 1 - (5,9) 

when  N is  large.  Therefore,  the  fixation  index  is  independent  of  the  initial 
gene  frequency  and  increases  from  0 to  1 as  t increases. 

We  have  seen  that  genetic  drift  gradually  increases  the  interpopulational 
variation  of  gene  frequency.  However,  the  genetic  variability  within  popula- 
tions gradually  declines.  This  can  be  studied  by  considering  the  average 
frequency  of  heterozygotes  within  populations  (H,).  The  frequency  of 
heterozygotes  in  a population  having  gene  frequency  x,  in  the  t-th  generation 
is  given  by  2xt(l  - x,).  Taking  the  average  of  2x,(l  - x,)  over  all  popula- 
tions, we  have 


H,  - 2E{*,(1  - x,))  = 2£{x,  - - p1)  - p2} 


= 2 (p  - V,  - p2) 


(5.10) 
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So  far  we  have  considered  a single  locus  in  a large  group  of  populations  of 
equal  size.  The  above  theory,  however,  can  also  be  applied  to  a large  number 
of  independent  neutral  loci  in  a single  population,  if  the  initial  gene  frequency 
is  the  same  for  all  loci.  In  this  case  //,  stands  for  the  average  frequency  of 
heterozygotes  per  locus  in  the  population  or  the  average  frequency  of  hetero- 
zygous loci  for  an  individual.  This  quantity  is  generally  called  average 
heterozygosity.  In  practice,  of  course,  the  assumption  of  an  equal  initial 
gene  frequency  is  unrealistic  except  in  artificial  populations.  However,  if  we 
replace  2p{\  - p)  by  the  average  heterozygosity  over  all  loci  at  the  0-th 
generation,  i.e.  by  2p{\  - p),  then  formula  (5.10)  holds. 

Formula  (5.10)  was  derived  for  the  case  of  two  alleles  at  a locus,  but  it 
holds  true  for  any  number  of  alleles.  Suppose  that  there  are  n alleles  at  a 
locus,  and  let  x;  be  the  frequency  of  the  i-th  allele  in  generation  t.  The 
heterozygosity  is  therefore  given  by  Ht  = The  next  generation 

is  formed  by  sampling  2N  genes  at  random  from  this  population,  so  that  the 
gene  frequencies  (xi)  in  generation  t + 1 follow  a multinomial  distribution. 
Thus,  the  expected  heterozygosity  in  generation  t + 1 is 

. = 2EfZ  = 2 £ 

\[cj  } i-zj 


since  = xtXj  - xixj/(2N)  (e.g.  Rao,  1952).  Therefore,  if  we  denote 

by  H0  the  heterozygosity  in  generation  0,  we  have 

--nr)' 

ft  <5.12) 

This  indicates  that  the  average  heterozygosity  per  locus,  which  is  an  important 
measure  of  genetic  variability  of  a population,  will  decline  at  the  rate  of 
1 /(2N)  per  generation,  if  there  are  no  mutation  and  selection. 

Formula  (5.11)  can  be  used  to  derive  the  recurrence  formula  for  homo- 
zygosity ( J(  -S>?)  between  two  generations.  Since  H = 1 - J,  we  have 


(5.13) 


Mutant  genes  in  finite  populations 


This  formula  will  be  used  in  a later  section.  Also,  from  (5.12), 

J,  — A-)'.  (5.14) 

It  is  noted  that  if  J0  = 0,  Jt  becomes  identical  to  Fst-  For  this  reason,  the 
two  quantities  are  often  confused.  In  practice,  however,  J0  never  becomes  0. 
Furthermore,  if  we  take  into  account  mutation  and  migration,  Jt  and  Fst 
take  different  forms,  as  will  be  seen  later. 

5.1.3  Effective  population  size 

In  the  above  formulation  we  have  assumed  that  the  organism  in  question 
is  monoecious  and  all  individuals  in  the  population  contribute  gametes  to 
the  next  generation  with  equal  probability,  though  there  may  be  chance 
variation.  In  practice,  however,  many  organisms  have  separate  sexes,  and 
there  are  almost  always  some  deviations  from  this  idealized  reproduction 
even  in  a monoecious  organism.  These  deviations  introduce  many  complica- 
tions in  mathematical  formulation,  but  they  can  be  avoided  if  we  use  a 
hypothetical  population  size  that  would  give  the  same  effect  on  gene 
frequency  distribution  as  in  the  idealized  population.  Such  a population  size 
is  called  effective  population  size.  This  concept  is  due  to  Wright  (1931)  and 
simplifies  the  mathematical  treatment  considerably. 

Crow  (1954)  has  distinguished  between  the  inbreeding  effective  size  and 
the  variance  effective  size.  The  former  is  defined  as  the  reciprocal  of  the 
probability  that  two  uniting  gametes  come  from  the  same  parent,  while  the 
latter  is  a population  size  that  would  give  the  same  variance  of  gene  frequency 
change  due  to  sampling  error  as  that  in  an  idealized  population  (5.7b). 
Namely,  the  variance  effective  size  is 

jVH=x(l  -x)l(2VsJt  (US) 

where  Vdx  is  the  variance  of  gene  frequency  change  for  a particular  case. 
In  this  book  we  shall  be  mainly  concerned  with  the  variance  effective  size, 
but  in  practice  there  is  not  much  difference  between  the  two  effective  sizes 
except  in  some  special  cases.  In  the  following  I shall  list  the  formulae  for 
estimating  effective  size  in  various  cases  without  going  into  detail. 

1)  Separate  sexes  (Wright,  1931).  If  the  population  consists  of  Nm  males 
and  Nf  females,  the  effective  size  (N,)  is  given  by 

= 4NmNfl(Nm  -+  Nf). 


(5.16) 
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Unless  N,„  = Nf,  this  is  always  smaller  than  the  actual  size  (N„,  + Nf). 

2)  Cyclic  change  of  population  size  (Wright,  1938a).  If  population  size 
changes  with  a relatively  short  period  of  n generations  and  Ar(.  is  the  popula- 
tion size  in  the  /'- th  generation  in  the  cycle,  then 

JVC  = N,  (117) 

where  ■ 1 is  the  harmonic  mean.  Therefore,  Ne  is  close  to  a 

i=  1 

smaller  size  rather  than  a larger  size  in  the  cycle. 

3)  Variation  in  progeny  size  (Wright,  1938a;  Crow,  1954). 

- 2*7(1  + VM  (US) 

where  k and  Vk  are  the  mean  and  variance  of  progeny  number  per  individual. 
If  progeny  number  follows  the  Poisson  distribution,  then  Vk  = k,  and  Ne  = 
N.  In  general,  however,  Vk  > k,  so  that  Ne  < N.  Crow  and  Morton  (1955) 
estimate  that  the  ratio  NJN  is  about  0.75  for  many  organisms.  In  human 
populations  in  which  birth  control  is  practiced  Vk  is  often  smaller  than  k, 
so  that  Ne  > N (Imaizumi  et  al.,  1970). 

4)  Heritable  fertility  (Nei  and  Murata,  1966). 


N 

(1  + 3ft J)C3  4*  l(k  ' 


(5.W 


where  h2  is  the  heritability  of  fertility  and  C 2 = Vjkz,  If  fc  = 2,  Vk  — 3, 
and  h2  = 0.3,  then  Ne  = 0.52  N. 

5)  Overlapping  generations  (Nei  and  Imaizumi,  1966a;  Felsenstein,  1971; 
Crow  and  Kimura,  1972;  Hill,  1972).  If  Na  is  the  number  of  individuals  born 
per  year  who  survive  up  to  reproductive  age  and  x is  the  mean  age  of  repro- 
duction, then 


= tJV,  (5,20) 

Nei  and  Imaizumi  estimate  that  in  human  populations  the  value  of  Ne 
computed  from  the  above  formula  is  about  40  percent  of  the  total  population 
including  nonreproductive  individuals. 

It  is  clear  from  the  above  discussion  that  the  effective  size  of  natural 
populations  is  generally  much  smaller  than  the  actual  size.  See  Crow  and 
Kimura  (1970)  for  the  mathematical  aspects  of  this  problem. 
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5.2  Diffusion  approximations 

5.2,1  Basic  equations  in  diffusion  processes 


Although  the  Markov  chain  method  is  useful  in  visualizing  the  process  of 
stochastic  change  of  gene  frequency  and  provides  the  exact  distribution  of 
gene  frequencies,  it  cannot  be  used  when  population  size  is  large.  Even  a 
big  computer  cannot  accommodate  the  matrix  computation  required  if  N 
is  large.  A more  powerful  method,  which  does  not  have  this  problem,  is  that 
of  diffusion  approximations.  In  fact,  it  was  this  method  that  enabled  Kimura 
(1955a,  b)  to  study  the  whole  process  of  gene  frequency  change  in  finite 
populations. 

In  diffusion  approximations  to  discrete  processes  it  is  assumed  that  gene 
frequency  changes  continuously  with  time.  That  is,  the  sample  path  (gene 
frequency  trajectory)  is  assumed  to  be  continuous.  This  assumption  is 
satisfactory  as  long  as  population  size  is  sufficiently  large,  since  in  this  case 
the  amount  of  gene  frequency  change  per  generation  is  very  small.  In  practice, 
it  has  been  shown  (Ewens,  1963a)  that  this  method  gives  satisfactory  results 
even  if  (diploid)  population  size  is  as  small  as  6. 

Let  </>(/? , x;  t)  be  the  probability  density  that  the  gene  frequency  of  A, 
becomes  x at  time  t (measured  in  generations),  given  that  the  intitial  gene 
frequency  is  p.  Clearly,  (f)(p , x;  t)  is  equivalent  to  ft(x ) in  the  foregoing 
section,  and  ft{x)  may  be  approximated  by  </>(p,  x;  0(1/2 N).  It  can  then  be 
shown  that  (j)(p,  x;  t)  satisfies  the  following  Kolmogorov  forward  equation. 


dt 


(5.21) 


where  ^ = (j)(p , x;t),  and  and  VSx  are  the  mean  and  variance  of  the 
change  in  x per  generation.  This  equation  is  also  called  the  Fokker- Planck 
equation.  Theoretically,  (f)(p,  x;  t)  can  be  obtained  by  solving  (5.21). 

In  population  genetics  it  is  often  important  to  know  the  equilibrium  gene 
frequency  distribution  when  the  effects  of  two  or  more  opposing  factors  are 
balanced.  For  this  purpose,  it  is  useful  to  know  the  net  probability  flux  at  x 
at  time  t.  This  flux  is  given  by 


o = - 1 -2-  { y,A)  + wa  (5.22) 


We  have  the  following  relation. 


Diffusion  approxin  tat  ions 


9] 


$4>  _ df%x,t) 

dt 


(5.23) 


Namely,  t^/Sf  represents  the  rate  of  net  flow  of  probability  across  the  point 

x. 

In  equations  (5.21)  and  (5.22)  the  initial  gene  frequency  p is  fixed  and  the 
gene  frequency  x at  time  t is  assumed  to  be  a variable.  In  other  words,  we 
consider  the  process  of  gene  frequency  change  in  the  forward  direction.  On 
the  other  hand,  it  is  possible  to  reverse  the  time  sequence  and  view  the 
process  retrospectively,  treating  x as  fixed  and  p as  a random  variable.  In 
population  genetics  we  generally  consider  the  case  where  the  process  is  time 
homogeneous.  That  is,  if  xtl  and  xt2  are  the  gene  frequencies  at  times  t , and 
t2(t i < t2),  respectively,  then  the  probability  distribution  of  x,,,  given  x,„ 
depends  only  on  the  time  difference  t2  — t,.  In  this  case  (j)(p , x;  t)  satisfies 
the  following  Kolmogorov  backward  equation 


dt  2 dp1  dp 


(5.24) 


This  equation  is  useful  in  deriving  the  probability  of  eventual  fixation  of  a 
mutant  gene,  fixation  time,  etc.,  as  will  be  seen  later. 

In  the  present  book  we  shall  not  discuss  the  proof  of  (5.21),  (5.22),  and 
(5.24).  The  reader  who  is  interested  in  the  derivation  may  refer  to  the  books 
by  Crow  and  Kimura  (1970)  and  Kimura  and  Ohta  (1971b). 

In  order  to  approximate  the  discrete  model  in  section  5.1.1  by  the  above 
diffusion  process  M6x  and  V5x  must  be  determined.  The  usual  method  of 
obtaining  these  quantities  is  that  of  Feller  (1951).  We  know  from  (4.10)  and 
(5.1)  that  the  mean  change  of  gene  frequency  per  generation  is 


E{Ax)  ■ 


t - A(1  - 2*)} 

1 + 2^(7-  x)  + sit*-' 
Assuming  that  s and  h are  of  the  order  of  Ar  ~ 1 , this  becomes 
^ jf(1  — x}{sx  + ft{l  - 2x}}  -f  0(N t_J) 


*125) 


= *1  - *){«  + « 1 - 2*)]/*,  + 0(Np), 

where  a = Nes  and  fi  = Neh.  On  the  other  hand,  the  variance  may  be  written 
as 


V(Ax)  = *(1  - x)f(2N.)  + 
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We  now  measure  time  in  units  of  Ne  generations,  so  that  At  = 1 fNt,  We  let 
Ne  ->  oo,  s y 0,  and  h ->■  0,  such  that  a and  fi  stay  constant.  Then, 

A-Jj*  = lim  — E(Ax ) = x(l  - x){ax  + /?(1  - 2x)}, 

jve-* oo  At 


V'ix 


lim  - V(Ax)  = 

Ne-+oo  At 


x(l  - x) 
2 


Therefore,  if  we  return  to  the  original  time  scale, 


— x(l  — x){sx  + I — 2x)}t 


(5.26) 


V*  = 4<  - x)j(2N,)r  {5,27) 

In  the  above  derivation  of  Mdx  we  have  assumed  that  a and  stay  constant 
as  Ne  ->  oo.  This  is  simply  a mathematical  assumption,  and  in  practice  it 
would  not  hold  true  in  most  cases.  On  the  other  hand,  if  we  assume  the 
continuity  of  sample  path  (gene  frequency  trajectory)  from  the  beginning, 
then 


x{\  - x){sx  + - 2.x)] 

l 4-  2Ax(l  ~ x)  4 -sx2 


(5.2S) 


may  be  used,  while  y*x  is  approximately  equal  to  (5.27)  (Maruyama,  1974a). 
Therefore,  we  can  use  either  formula  for  Mdx,  depending  on  the  assumption 
made.  As  long  as  the  values  of  s and  h are  small,  they  give  essentially  the 
same  result.  Numerical  computations  have  shown  that  if  s and  h are  large, 
(5.28)  generally  gives  a better  approximation  to  the  discrete  process  than 
(5.26).  In  the  following  we  use  (5.26),  simply  because  it  is  simpler. 


5.2.2  Transient  distribution  of  gene  frequencies 


Theoretically,  the  gene  frequency  distribution  f(p,  x;  t )can  be  obtained  by 
solving  equation  (5.21),  as  mentioned  earlier.  In  practice,  it  is  not  easy  to 
get  a general  solution  to  this  equation.  So  far,  a complete  solution  has  been 
obtained  only  for  two  cases,  i.e.  the  cases  of  no  selection  and  genic  selection. 
In  the  case  of  no  selection  and  no  mutation  \fix  = 0 and  Vdx  = x(l  - x)/ 
(2 Nc).  Therefore,  (5.21)  becomes 


~bt 


1 d 


^ {41  -3 


(5.29) 
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The  required  solution  to  this  equation  with  the  appropriate  initial  condition 
has  been  obtained  by  Kinrura  (1955a)  and  is  given  by 


= £ *0  - P)iV  + mi  + l)F(i  “ f, i + p) 

S-  L 

x F{  1 - i,  J + 2,  2,  (5,30) 

where  F(  •,  •,  •,  •)  stands  for  the  hypergeometric  function  so  that 


i,  i + 2,2,  x}  = J + - 


■W  +_2)jc 


l ■ 2 


, (1  -iX2-W  +2Xi  + 3)„i  . 
^ I * 2 ■*  2 ■ 3 


Fig.  5.2.  The  processes  of  the  change  in  the  probability  distribution  of  gene  frequencies, 
due  to  random  sampling  of  gametes  in  reproduction.  It  is  assumed  that  the  population 
starts  from  the  gene  frequency  0.5  in  fig.  5.2a  and  0.1  in  fig.  5.2b.  T = time  in  generation; 
N = effective  population  size;  abscissa  is  gene  frequency;  ordinate  is  probability  density. 
This  distribution  does  not  include  gene  frequency  classes  x = 0 and  x = 1.  From  Kimura 
(1955a). 
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Fig.  5.3.  Distributions  of  gene  frequencies  in  19  consecutive  generations  among  105  lines 
of  Drosophila  each  of  16  individuals.  The  gene  frequencies  refer  to  two  alleles 

at  the  'brown'  locus  (bw75  and  bw),  with  initial  frequencies  of  0.5.  The  height  of  each  black 
column  shows  the  number  of  lines  having  the  gene  frequency  shown  on  the  scale  below. 

From  Buri  (1956). 
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The  property  of  this  distribution  is  best  understood  by  looking  at  the  graphs 
in  fig.  5.2.  It  is  clear  from  this  figure  that  for  a given  value  of  p the  distribu- 
tion depends  on  two  factors,  population  size  and  generation.  If  population 
size  is  small,  the  distribution  becomes  flat  rather  quickly,  but  if  it  is  large  it 
takes  a long  time.  As  generation  proceeds,  the  distribution  becomes  eventually 
uniform  and  then  there  is  no  change  in  form,  though  the  absolute  frequency 
steadily  declines.  The  distribution  at  this  stage  is  called  steady  decay  distribu- 
tion. Forp  = 0.5,  the  time  required  to  reach  this  steady  decay  distribution  is 
about  2N  generations  when  N is  the  effective  population  size,  while  for  p = 
0.1  it  is  about  4N  generations.  Note  that  the  distribution  (5.30)  does  not 
include  the  gene  frequency  classes  x = 0 and  x = 1 . 

In  order  to  see  how  this  theory  applies  to  real  data,  let  us  consider  an 
example  from  Drosophila  experiments.  Buri  (1956)  studied  the  gene  frequency 
changes  of  two  alleles  (J!uv7S  and  bw)  at  the  'brown'  locus  in  105  lines  of 
Drosophila  melanogaster,  each  line  consisting  of  8 males  and  8 females.  The 
initial  gene  frequency  of  frii'15  was  0.5  in  all  lines.  The  results  obtained  are 
given  in  fig.  5.3,  where  the  frequencies  of  the  fixed  classes  (x  = 0 and  x = 1) 
include  only  those  cases  in  which  the  allele  bwls  was  newly  fixed  or  lost.  It 
is  seen  that  the  distribution  of  gene  frequencies  becomes  gradually  flat  as 
generation  proceeds  and  after  about  17  generations  the  distribution  is 
virtually  uniform.  Clearly,  the  steady  decay  distribution  was  reached  much 
earlier  than  expected,  since  the  population  size  is  16  in  this  case.  This  differ- 
ence seems  to  be  due  to  the  fact  that  the  so-called  effective  size  is  much 
smaller  than  the  actual  size  in  most  cases.  In  fact,  Buri  has  shown  that  if  the 
effective  population  size  in  this  experiment  was  1 1.5  (72%  of  the  actual  size), 
Kimura's  distribution  fits  the  data  quite  well. 

When  there  is  selection,  the  form  of  the  gene  frequency  distribution 
changes,  and  the  steady  decay  distribution  is  no  longer  uniform.  However, 
the  detail  of  the  gene  frequency  distribution  is  not  known  except  for  the 
case  of  genic  selection  (Kimura,  1955b). 


5.3  Gene  substitution  in  populations 

5.3.1  Probability  cf  fixation  of  mutant  genes 

Aside  from  the  occasional  occurrence  of  genome  or  gene  duplication, 
evolution  takes  place  through  the  process  of  gene  substitution  in  populations. 
We  have  seen  that  if  a new  advantageous  mutation  occurs,  it  may  be  fixed 
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in  the  population  but  not  with  probability  1.  We  have  also  seen  that  in  a 
finite  population  a new  mutant  gene  may  be  fixed  even  if  it  has  no  selective 
advantage.  It  is  clearly  important  to  determine  the  probability  of  fixation 
of  a mutant  gene  with  a given  selective  advantage.  This  problem  was  first 
studied  by  Fisher  (1922),  using  the  branching  process  method.  Later,  using 
the  same  method,  Haldane  (1927)  and  Fisher  (1930)  derived  a formula  for 
the  probability  of  fixation  of  a mutant  gene  with  genic  selection  in  a large 
population.  The  probability  of  fixation  in  a finite  population  was  also  studied 
by  Fisher  (1930)  and  Wright  (1931,  1942).  The  most  general  formula  so  far 
obtained  is,  however,  due  to  Kimura  (1957,  1962).  His  method  of  solving 
the  problem  is  different  from  those  of  his  predecessors;  he  used  the  Kolmo- 
gorov backward  equation.  Let  us  now  study  this  method  briefly. 

The  general  form  of  Kolmogorov  backward  equation  is  given  by  (5.24). 
In  the  present  case  we  are  interested  in  the  probability  of  fixation  of  mutant 
gene  A i.e.  (j)(p , 1 ; t),  which  we  denote  by  u(p,  t).  Therefore,  the  Kolmogorov 
backward  equation  becomes 

Mp>i)  _ „ du(p,t)  Vit  aVftO 

__*V^  4 -j— **-■  <53l> 

Our  problem  is  to  determine  the  ultimate  probability  of  fixation  of  A,. 
Namely, 

u(p)  = lim  u(p,  t). 


Since  du(p , i)jdt  = 0 when  t ->  oo,  (5.31)  reduces  to 


Vdp  M duff) 

2 dp2  dp 


{5-32) 


This  differential  equation  can  be  solved  with  the  boundary  conditions 


u(  0)  = 0,  «( 1)  = 1. 

The  equation  (5.32)  may  be  written  as 


Thus, 


~sr  1 
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where  cy  is  a constant.  Therefore, 

p 

m(p)  ■ ci  J + c2i 

0 

where 

C(x)  = e"ftJN*^1,,Jt  (5.13) 

and  c2  is  another  constant.  Since  n(0)  = 0,  c2  must  be  0,  while  the  condition 

z/(l)  = 1 gives  cx  = [J  G(x)dx]_I.  Therefore,  we  have  the  following  solu- 
o 

tion. 

p i 

u(p ) = J C(xy \xf  I G(*) dx,  (5.34) 

0 0 

This  formula  was  first  given  by  Kimura  (1962). 

Now,  let  1 + s,  1 + h,  and  1 be  the  fitnesses  of  genotypes  A,  A,,  AtA2, 
and  A2A2,  respectively.  Mdx  and  Vdx  are  given  by  (5.26)  and  (5.27),  respec- 
tively. Therefore,  putting  these  into  (5.34),  we  obtain 

r 

r; . <5.35) 

o 

Let  us  now  consider  some  special  cases. 

1)  Neutral  genes.  If  the  A,  gene  is  neutral  with  respect  to  fitness  (s  = h = 
0),  then  G(x)  = 1.  Therefore, 

u(p)  = p,  (5.36) 

Namely,  the  probability  of  fixation  of  a neutral  mutation  is  equal  to  the 
initial  gene  frequency,  as  is  obvious.  Thus,  a nonrecurrent  unique  mutation 
in  a population  of  size  N will  be  fixed  with  a probability  of  only  1/(2 A). 

2)  Genic  selection.  If  the  selective  advantage  of  a mutant  gene  is  additive, 
then  h = s/2.  Tn  the  case  of  genic  selection,  however,  it  is  customary  to  denote 
the  fitnesses  of  A,  A,,  A,  A„  and  A2A2  by  1 + 2s,  1 + s,  and  1 rather  than 
by  I + s,  1 + s/2,  and  1,  respectively.  Thus,  we  have  G(x)  = exp  (-  4 Nesx) 
and 
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u(p)  = (L  — sm*ftaV}l(l  - r4*'1).  (5,37) 

Ifp  = 1/(2  A),  this  reduces  to 

u(l/2N)  = (1  - - e *v '•*}«  (5-3?) 

Furthermore,  if  Ne  = N and  s is  small  compared  with  I,  is  1 - 2s 

approximately.  So,  we  have 

u(l!2N)  = 2 s/(l  - c *^>.  (5.39) 

This  formula  is  equal  to  that  obtained  by  Fisher  (1930)  and  Wright  (1931). 
It  is  also  interesting  to  see  that  if  N -►  co,  w(l/2A)  is  equal  to  2s,  which 
agrees  with  the  result  obtained  by  the  branching  process  method  (Haldane, 
1927;  Fisher,  1930),  where  population  size  is  assumed  to  be  infinitely  large. 

On  the  other  hand,  if  4 Nes  « 1,  then  u(p)  is  approximately  equal  to  p 
from  (5.37).  Namely,  in  this  case  the  mutant  gene  behaves  just  like  a neutral 
allele. 

In  section  5.1  we  have  seen  by  the  method  of  Markov  chains  that  the 
probability  of  fixation  of  a mutant  gene  with  ^ = 0.1  in  a population  of 
N = Ne  = 10  is  0.1755.  If  we  use  (5.38),  the  probability  becomes  0.1846. 
So,  this  is  very  close  to  the  exact  probability  even  if  N is  very  small  and  s 
is  quite  large.  If  N is  large  and  s is  small,  the  agreement  between  the  values 
obtained  by  the  two  methods  is  much  better. 

If  the  mutant  gene  A,  is  disadvantageous  and  the  fitnesses  of  A^AV, 
A,  A„  and  A2A2  are  1 - 2s,  1 - s,  and  1,  respectively,  then 

u(l/2/V)  = (eij  - 1 )((eASl  - L),  (5.40) 

where  Ne  = N is  assumed.  If  s « I,  u(l/2N)  is  approximately  - 1). 

Therefore,  if  4 Ns  is  small,  even  a deleterious  mutation  may  be  fixed  with  an 
appreciable  probability. 

3)  Dominant  genes.  In  this  case  h = s.  Thus,  G(x ) = exp  { - 2Nes(2x  - 
x2)}.  When  2 Nes  is  large  compared  with  unity,  G(x ) rapidly  decreases  as  x 
increases  from  0 to  1.  Therefore,  it  may  be  approximated  by  G(x)  = exp 
(-  4Nesx),  which  is  the  same  as  that  for  the  case  of  semidominant  genes. 
Namely,  the  probability  of  fixation  of  a dominant  mutation  is  approximately 
the  same  as  that  of  a semidominant  mutation.  This  indicates  that  the  prob- 
ability of  fixation  of  a mutant  gene  is  largely  determined  by  the  heterozygote 
fitness. 

4)  Recessive  genes.  Since  h = 0 in  this  case,  G(x)  = exp  (-  2 Nesx2). 
The  numerator  in  (5.35)  may  be  written  as 
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where  erf(.v)  is  the  error  function  and  defined  as 


crf(x)  = ^ j c |3d( 

_ 2x  / ^ x2  jf4 

\ j :"iT  ~ ~y2] 


■) 


Similarly,  the  denominator  may  be  expressed  as 
Therefore,  we  have 


u{p)  = erf  Q2Nes  p)/erf  [yf2Nj)t  (5,41) 

The  values  of  erf(x)  may  be  obtained  from  a table  (e.g.,  Abramowitz  and 
Stegun,  1964). 

If  yfilN'S)  > 2,  erf{vr(2A'fs)}  is  1 approximately,  and  if  p = 1/(2AA), 
wfW(2^)}  *s  approximately.  Therefore,  if  Ne  = N, 

u(l/2Af)  « ^/2sl(irNY  {5A2) 

This  indicates  that  in  a large  population  the  probability  of  fixation  of  a 
recessive  mutation  is  very  small.  Formula  (5.42)  is  due  to  Kimura  (1957), 
but  slightly  less  accurate  formulae  had  been  obtained  by  Haldane  (1927) 
and  Wright  (1942). 

5)  Overdominant  genes.  Nei  and  Roychoudhury  (1973a)  studied  the 
probability  of  fixation  of  a single  overdominant  mutation.  In  an  infinitely 
large  population  a pair  of  overdominant  genes  create  a stable  polymorphism 
and  may  exist  forever  in  the  population,  as  we  have  seen  in  ch.  4.  In  finite 
populations,  however,  even  an  overdominant  mutation  will  eventually  be 
fixed  or  lost  from  the  population.  Let  1 — s,,  1,  and  I - s2  be  the  fitnesses 
of  A1Al,  A,A„  and  A2A2,  respectively.  We  have  seen  that  in  a large 
population  the  equilibrium  gene  frequency  of  A,  is  given  by  m = s2/(s1  + 
■s2).  The  probability  of  fixation  of  a single  overdominant  mutant  gene  is 
highly  dependent  on  this  m value  and  + 52).  If  m < 0.5  (disadvan- 

tageous overdominant  genes),  the  probability  is  generally  much  lower  than 
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that  of  neutral  genes;  but  if  m is  close  to  0.5  and  N(s1  + sf)  is  relatively 
small,  it  becomes  higher.  If  m > 0.5  (advantageous  overdominant  genes), 
the  probability  is  largely  determined  by  the  fitness  of  heterozygotes  rather 
than  the  fitness  of  mutant  homozygotes.  Thus,  overdominance  enhances  the 
probability  of  fixation  of  advantageous  mutations.  Of  course,  if  m is  close 
to  0.5  and  Ne(s l + s2)  is  large,  the  time  to  fixation  of  an  overdominant 
gene  is  very  large,  as  will  be  seen  later. 

The  theory  of  the  probability  of  fixation  of  a mutant  gene  discussed  in  this 
section  is  dependent  on  the  assumption  of  a single  random  mating  popula- 
tion. Most  natural  populations  are,  however,  divided  into  many  subpopula- 
tions. Fortunately,  the  above  theory  seems  to  hold  even  in  subdivided 
populations  at  least  in  the  cases  of  no  selection  and  genic  selection,  if 
migration  takes  place  among  subpopulations  (Maruyama,  1970a).  In  this 
case  N stands  for  the  total  population. 

5.3.2  Rate  <f  gene  substitution  and  average  substitution  time 

In  eh.  3 we  have  seen  that  the  rate  of  mutation  per  nucleotide  or  codon  per 
generation  is  very  small.  It  is,  therefore,  quite  satisfactory  to  assume  that  at 
the  codon  level  a new  mutation  occurring  in  a population  is  always  different 
from  the  preexisting  alleles  in  the  population.  If  the  mutation  rate  per 
generation  is  v at  a locus,  then  there  occur  2Nv  mutations  at  this  locus  in 
every  generation,  all  mutant  alleles  being  different  from  each  other  at  the 
codon  level.  In  the  case  of  neutral  mutations  only  1/(277)  of  the  2Nv  mutations 
will  be  fixed  (see  fig.  5.4).  Therefore,  at  the  steady  state  where  the  effects  of 
mutation  and  genetic  drift  are  balanced,  the  rate  of  gene  substitution  per 
generation  is 


CC  CR  RC  CC  RCR  RC  Time 


Fig.  5.4.  A typical  pattern  of  extinction  and  multiplication  of  selectively  neutral  mutants 
in  a finite  population  when  they  occur  at  the  rate  of  one  mutation  every  ten  generations 
(4^rv  -=  0.2).  A's  represent  mutations.  At  a particular  evolutionary  time  a population 
may  be  monomorphic  or  polymorphic  for  two  common  alleles  ( CC),one  common  allele 
and  one  rare  allele  ( CR),  etc.  From  Kimura  and  Ohta  (1973b). 
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H = 2JYi>  ■ _L  . . ft  (5.43) 

Namely,  the  rate  of  gene  substitution  is  equal  to  the  mutation  rate  per  locus. 
This  simple  rule  was  first  noted  by  Kimura  (1968a). 

In  general,  a new  mutant  gene  is  fixed  with  a probability  of  u = u(l/2N), 
which  is  given  by  (5.34).  Tliereforc,  the  rate  of  gene  substitution  at  the 
steady  state  is 

a = 2Npm.  (5.44) 

If  Ne  = N and  new  mutant  genes  are  semidominant  or  completely  dominant, 
then  hs  2s  in  large  populations.  Thus,  the  rate  of  substitution  of  such  genes 
is 


a = 4Afsu,  (5,45) 

which  depends  on  three  factors,  i.e.  population  size,  selection  coefficient,  and 
mutation  rate.  For  the  rate  of  gene  substitution  to  be  constant,  as  is  apparently 
the  case  with  some  proteins,  N,  s,  and  v must  therefore  be  adjusted  in  the 
course  of  evolution  in  such  a way  that  their  product  remains  constant  per 
year  over  diverse  evolutionary  lines  such  as  primates  and  fungi.  Kimura 
(1969b)  and  Kimura  and  Ohta  (1971a)  think  that  this  is  unlikely  and  a much 
simpler  explanation  of  constant  rate  of  gene  substitution  is  to  assume  that 
a majority  of  gene  substitutions  have  occurred  by  random  fixation  of  neutral 
or  nearly  neutral  mutations. 

Since  the  rate  of  gene  substitution  is  2 Nvu  per  locus  per  generation,  the 
average  time  for  one  gene  substitution  to  occur  in  a population  of  size  N 
is  given  by 

Ta  = 1/(2jVm),  (5-46) 

Namely,  on  the  average  in  every  Tg  generations  one  gene  substitution  is 
expected  to  occur  (fig.  5.4).  Margoliash  and  Smith  (1965)  called  Tg  the 
unit  evolutionary  period.  If  mutant  genes  are  selectively  neutral,  Tg  = l/v 
(Crow  and  Kimura,  1970). 

For  example,  the  hemoglobin  /1-chain  gene  has  146  codons.  It  is  known 
that  the  rate  of  codon  substitutions  per  locus  is  ID-7  per  year.  Thus,  the 
average  time  for  one  codon  substitution  to  occur  is  Tg  = 107  years. 

In  a recent  study  of  population  dynamics  of  neutral  mutations  Guess 
and  Ewens  (1972)  claimed  that  the  parameter  Tg  is  biologically  meaningless 
unless  4 Nv  « 1.  Their  conclusion  is  based  on  the  model  of  infinite  alleles 


102 


Mutant  genes  in  finite  populations 


per  locus,  which  will  be  discussed  later.  However,  if  gene  substitutions  are 
counted  at  each  codon  separately  and  then  summed  over  all  codons  to  get 
the  rate  of  gene  substitution  per  cistron,  the  above  definition  of  Tg  is  quite 
meaningful. 

5.3.3  Fixation  time  and  extinction  time  <f  mutant  genes 

Let  us  now  consider  how  long  it  takes  for  a mutant  gene  to  be  fixed  in  the 
population.  More  specifically,  we  trace  a particular  mutant  allele  and  study 
the  average  number  of  generations  at  which  the  frequency  of  the  allele 
becomes  1 (fig.  5.4).  Theoretically,  this  average  fixation  time  can  be  obtained 
by  integrating  the  sojourn  time  that  the  gene  frequency  spends  at  a particular 
value  x,  given  that  the  allele  is  going  to  be  fixed  (Maruyama  and  Kimura, 
1971;  Ewens,  1973).  Here,  however,  we  follow  the  method  used  by  Kimura 
and  Ohta  (1969a),  since  it  gives  a better  understanding  of  the  process. 

As  in  section  5.3.1,  let  u(p,  t)  be  the  probability  that  the  mutant  gene 
frequency  becomes  fixed  in  the  population  by  generation  t,  given  that  the 
initial  gene  frequency  is  p.  Since  the  probability  that  the  mutant  gene  is 
fixed  at  generation  t is  du(p,t)/8t , the  average  number  of  generations  at 
which  the  gene  is  fixed  is  given  by 


o 

We  are  not,  however,  interested  in  the  event  in  which  the  mutant  gene  is  lost 
from  the  population.  Therefore,  if  the  eventual  probability  of  fixation  of  the 
A,  gene  is  u(p),  then  the  average  fixation  time  is  given  by 

f.00  = Tt(p)MpY  (5.47) 

We  first  derive  the  formula  for  7\(p)  by  using  (5.31).  Differentiating  each 
term  of  (5.31)  with  respect  to  t,  multiplying  each  resulting  term  by  t,  and 
integrating  them  with  respect  to  t from  0 to  oo,  we  have 


The  left-hand  side  of  this  equation  is 
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where  we  have  assumed  that  tdu(p,t)/dt  vanishes  at  t = go.  Therefore,  we 
have  the  following  differential  equation 

iTfp}  + AfriHfo)  + m = o,  (5.4S) 

where  a(p ) = 2 M§sjVif  and  b(p)  = 2u(p)/Vdp.  The  boundary  conditions  for 
(5.48)  are  r,(0)  = 0 and  T^l)  = 0.  Solution  of  (5.48)  with  these  boundary 
conditions  gives 

Ti{p)  = u(p)  | ^)k(z){I  - 
r 

p 

+ {L  - u(p)}  J ^(z)Ms{z)dr,  (5.49) 

o 

where  u(p)  is  given  by  (5.34)  and 

m = 2 | 

in  which  G(x)  is  given  by  (5.33).  From  (5.47)  and  (5.49),  the  average  fixation 
time  is  then  given  by 

i f 

1,0)  = f - «(2)}dz  4 - ~ ^ J d*.  (3.50) 

J*  0 

The  average  number  of  generations  for  a mutant  gene  to  be  lost  from  the 
population  can  be  obtained  in  the  same  way.  The  result  is  given  by 
i 

Up)  = x / ^K1  “ w(£}]2dz 


(5.51) 
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The  variance  of  fixation  time  or  extinction  time  can  also  be  studied  in  the 
same  way.  In  this  case,  however,  it  is  more  convenient  to  use  the  concept 
of  sojourn  time.  In  practice,  the  variance  is  very  large.  The  standard  error 
of  fixation  time  is  generally  of  the  same  order  of  magnitude  as  the  mean 
(Kimura  and  Ohta,  1969b;  Narain,  1970). 

Let  us  now  consider  some  special  cases  to  get  a rough  idea  about  the 
average  fixation  and  extinction  times. 

1)  Neutral  genes.  In  this  case  MSx  = 0 and  Vbx  = x(l  - x)/(2Ne ).  So, 
G(x ) = 1,  i j/(x)  = 4jV,r/{jf(l  - x)},  and  u(p)  = p.  Hence, 

Ii(p)  = - 4N.  leg,(l  - p).  (5.52) 

If  population  size  is  large  and  the  initial  gene  frequency  is  1/(2 N),  then 

h H(1/2JV)  = 4 Ne  (5.53) 

approximately,  by  taking  the  limit  o f p ->  0.  Therefore,  it  takes  a long  time 
for  a mutant  gene  to  be  fixed  in  the  population,  if  Ne  is  large.  The  average 
extinction  time  of  a neutral  mutation  is  much  shorter  than  the  average 
fixation  time  and  given  by 

= - 4-V,  (jT^)  (5.54) 

which  becomes 

rD  ee  U1/2JV)  - 2(NJN)\q&(2N)  (5.55) 

approximately,  if  p is  1 /(27V).  For  example,  if  NJN  =0.8  and  N = ID4,  the 
extinction  time  is  about  16  generations. 

2)  Genic  selection.  If  the  mutant  gene  is  selectively  advantageous  over 
the  wild-type  allele  and  the  fitnesses  of  AXAU  AXA2,  and  A2A  2 are  given 
by  1 + 2s,  1 + s,  and  1,  respectively,  then  Mbx  = 5x(l  - x).  On  the  other 
hand,  = x(l  - x)/(2Ne)  as  before.  Thus,  putting  these  into  (5.49),  we 
can  obtain  the  fixation  time.  However,  the  resulting  formula  is  somewhat 
complicated  (Kimura  and  Ohta,  1969a),  and  I shall  not  reproduce  it  here. 
Numerical  computations,  however,  indicate  that  the  fixation  time  of  a 
semidominant  mutation  is  shorter  than  that  of  a neutral  mutation,  as  expected. 
For  example,  when  Nes  = 2.5,  the  fixation  time  is  about  half  that  of  a neutral 
mutation. 

3)  Mutant  genes  with  overdominance  and  complete  dominance.  Let 
1 — .v ! , 1 , and  1 - ^2  be  the  fitnesses  of  A XAX,  AXA2,  and  A2A2,  respectively. 
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Then,  MSx  = (s{  + .v2).v(  I - a-)(jw  - x)  and  Vdx  = x(l  - where 

ni  = s2l(sl  + s2).  Using  these  quantities,  it  can  be  shown  that  when  p = 
1/(2 N),  the  average  fixation  time  is 


h = 


/-• 


v ! 

KyO  - j 


(5.56) 


approximately,  where 


A = 2 Ne(si  + s,)  and  K 


J exp  A(x  - Mr)*dx 
o 


Fig.  5.5.  Mean  fixation  time  of  an  overdominant  mutation  relative  to  that  of  a neutral 
mutation.  From  Nei  and  Roychoudhury  (1973a). 
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(Nei  and  Roychoudhury,  1973a).  Fig.  5.5  shows  some  of  the  numerical  values 
for  the  case  of  Ne  = N.  In  this  figure  ti  is  expressed  relative  to  the  fixation 
time  of  a neutral  mutation,  i.e.  4N.  The  relative  fixation  time  depends 
markedly  on  the  value  of  m.  As  expected,  if  m is  close  to  0.5,  the  fixation 
time  is  much  longer  than  that  for  neutral  genes  when  N(st  + ^2)  is  large. 
However,  if  m is  outside  the  range  of  approximately  0.2  to  0.8,  the  fixation 
time  of  overdominant  mutations  is  shorter  than  that  of  neutral  mutations, 
depending  on  the  value  of  N(s1  + 52).  A continued  increase  in  this  quantity 
gradually  widens  the  range  of  m for  prolonged  mean  fixation  time.  It 
is  seen  that  the  relative  fixation  time  is  virtually  symmetric  around  m = 
0.5.  Namely,  a disadvantageous  overdominant  mutation  with  m < 0.5 
has  the  same  fixation  time  as  that  of  an  advantageous  overdominant  muta- 
tion with  1 - m if  N(s t + .s^)  is  the  same.  The  symmetry  of  fixation  time 
around  m = 0.5  can  be  seen  also  from  expression  (5.56).  It  is  interesting 
to  see  that  the  dependence  of  1 on  m and  N(sx  + s2)  is  similar  to  that  of  the 
rate  of  decay  of  genetic  variability  at  steady  state  studied  by  Robertson 
(1962)  and  Miller  (1962),  though  the  reason  is  not  the  same. 

We  note  that  = 0 represents  the  case  of  completely  dominant  genes. 
In  this  case  m = 1,  so  that  the  fixation  time  of  a completely  dominant  gene 
is  generally  much  shorter  than  that  for  a neutral  gene,  as  expected.  In- 
terestingly, however,  a completely  recessive  mutation  with  a selective  dis- 
advantage of  s (m  = 0)  has  the  same  fixation  time  as  that  of  a completely 
dominant  mutation  with  a selective  advantage  of  s if  population  size  is  the 
same.  This  paradox  is  resolved  if  we  note  that  the  probability  of  fixation  of  a 
recessive  disadvantageous  gene  is  very  low  and  if  it  is  fixed  its  frequency 
should  be  increased  rapidly  by  genetic  drift. 

4)  Deleterious  mutations.  Let  1 - s,  1 - h,  and  1 be  the  fitnesses  of 
AXAX,  AXA2,  and  A2A2,  respectively.  If  h > 0.03,  s > 0.5,  and  4 Neh  » 1, 
then  there  arise  virtually  no  homozygotes  in  the  population  and  selection 
against  the  mutant  gene  occurs  mostly  in  the  heterozygous  state.  In  this 
case  it  can  be  shown  that 

= 2{N'iN)[\og,(Nf2Njt)  -f  0.433]  (5. 57) 

(Kimura  and  Ohta,  1969b;  Li  and  Nei,  1972).  Thus,  is  independent  of 
population  size  if  NJN  remains  constant. 

Since  the  extinction  time  of  a deleterious  mutation  is  important  from  the 
standpoint  of  public  health,  this  problem  has  been  studied  extensively  by 
Nei  (1971c)  and  Li  and  Nei  (1972).  The  extinction  time  is  highly  dependent 
on  the  heterozygous  effect  of  a mutant  gene  and  population  size.  It  has  been 
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shown  that  if  h > 0.02  and  .s'  > 0.5,  the  extinction  time  is  only  a few  genera- 
tions and  almosL  independent  of  population  size.  If  the  mutant  gene  shows 
a slight  overdominance,  the  extinction  time  increases  rapidly  with  increasing 
population  size.  For  example,  if  h = - 0.02  and  s = 1,  the  extinction  time 
is  13  generations  for  Ne  = 1000,  but  2090  generations  for  Nc  = 10,000. 

Another  important  problem  in  relation  to  public  health  is  the  total  number 
of  heterozygous  or  homozygous  individuals  affected  by  a single  deleterious 
mutation.  This  problem  has  been  studied  by  Nei  (1971d)  and  Li  and  Nei 
(1 972). 

5.3.4  first  arrival  time  and  age  of  a mutant  gene 

Natural  populations  contain  a large  number  of  polymorphic  genes.  It  is 
interesting  to  know  how  long  a particular  polymorphic  allele  has  existed 
in  the  population  after  it  arose  by  mutation.  This  problem  can  be  studied 
in  two  different  ways.  One  is  to  ask  the  average  number  of  generations 
required  for  a mutant  allele  to  reach  the  present  frequency  on  the  assumption 
that  this  frequency  was  reached  for  the  first  time.  This  is  called  the  average 
first  arrival  time.  The  other  is  to  determine  the  same  average  number  of 
generations,  taking  into  account  the  possibility  that  the  gene  frequency  has 
been  higher  than  the  present  one.  This  is  called  the  average  age. 

The  average  first  arrival  time  from  gene  frequency  p to  x can  be  obtained 
by  terminating  the  process  of  gene  frequency  change  as  soon  as  it  reaches  x. 
In  this  modified  process  the  probability  that  gene  frequency  change  ter- 
minates at  x,  starting  from  p,  is 

n-  x 

«,tp>  = J com  | ca>tu.  (5.58) 

iv  « 

Then,  the  average  number  of  generations  at  which  the  gene  frequency  reaches 
x for  the  first  time  is 


Up)  = dt/p.(p).  (5.59) 

0 

where  u(p,x;t ) is  the  probability  density  that  the  gene  frequency  changes 
from  p to  x during  t generations  in  the  modified  process.  Therefore,  the 
average  first  arrival  time  to  gene  frequency  x can  be  obtained  in  the  same  way 
as  that  for  the  mean  fixation  time.  Namely, 
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j p 

hip)  = | iMz)u*U$£l  - J ^{z}*f^=)tlzp  (5,60) 

I*  0 

where 


wo  = 2 j Gv:w.iiv„Gun 

0 

(Kimura  and  Ohta,  1973c). 

In  the  case  of  neutral  mutations  tx(p)  for  p = 1/(2 TV)  is 

U1/2JV)  = 4tfr[{{l  - x)/*}lo&(l  - x)  + 1],  (5,6L) 

If  xis  small,  tx(l/2N)  a;  4Aex.  Thus,  when  Aeis  large,  ix(l/2N)  is  quite  large 
even  for  a rather  small  value  of  x. 

The  average  age  of  a mutant  gene  has  also  been  studied  by  Kimura  and 
Ohta  (1973c)  and  Maruyama  (1974b).  The  determination  of  this  quantity  is 
somewhat  complicated.  Particularly  if  we  take  into  account  the  possibility 
that  the  gene  frequency  can  reach  1 (fixation)  and  then  decline  due  to  new 
mutations,  the  mathematical  formula  is  no  longer  simple.  At  the  codon  or 
nucleotide  level,  however,  this  possibility  may  be  neglected,  and  the  average 
age  of  a neutral  mutation  is  given  by 

= - {ANtXfil  - Jc)]Iogrx.  (5.62) 

The  average  age  is  always  larger  than  the  average  first  arrival  time,  as  it 
should  be.  For  example,  if  Ne  = ]0fi  and  x = 0.1,  then  t(l/2N,  x)  = 106 
while  tx(\/2N ) = 4 x 10s.  These  computations  suggest  that  many  poly- 
morphic genes  existing  in  the  present  natural  populations  have  an  extremely 
long  history.  In  some  organisms  such  as  man  10fi  generations  is  longer  than 
the  history  of  the  species  itself. 


5.4  Stntionnry  distribution  of  gene  frequencies 

5. 4.1  General  /on mi  fa 

Jn  sections  5.1  and  5.2  we  have  seen  that  random  genetic  drift  acts  to  reduce 
the  genetic  variability  of  a population.  In  nature  this  reduction  in  genetic 
variability  is  counteracted  by  mutation  and  migration.  Selection  acts  cither 
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to  reduce  or  to  retain  the  genetic  variability,  depending  on  whether  it  is 
directional  or  balancing.  If  the  three  different  evolutionary  forces,  genetic 
drift,  mutation-migration,  and  selection,  act  together  in  a population,  it  is 
expected  that  their  effects  are  eventually  balanced  with  each  other  and  the 
gene  frequency  distribution  reaches  some  stable  form.  As  a concrete  example, 
consider  a completely  recessive  deleterious  gene  A , at  a locus,  and  assume 
that  the  same  type  of  allele  repeatedly  arises  by  mutation  from  its  normal 
allele  with  a frequency  of  u per  generation.  All  the  deleterious  mutations 
need  not  be  the  same  at  the  codon  or  nucleotide  level.  If  they  have  the  same 
phenotypic  effect,  they  can  be  lumped  together  and  handled  as  the  same 
allele,  as  mentioned  earlier.  Under  this  assumption,  the  effects  of  mutation 
and  selection  will  be  balanced  at  the  gene  frequency  (x)  of  A , equal  to  yj(u/s) 
if  the  population  size  is  infinitely  large  and  the  fitness  of  A yA , is  reduced  by 
s.  In  finite  populations,  however,  genetic  drift  tends  to  spread  the  gene 
frequency  distribution  in  every  generation,  so  that  x reaches  some  stable 
distribution. 

Mathematically,  such  a stable  distribution  can  be  obtained  by  using  the 
formula  for  the  probability  flux  (5.22).  It  is  clear  that  at  equilibrium  the 
gene  frequency  distribution  (j)(p,x\  t)  will  have  a stable  form  and  be  in- 
dependent of  p and  t.  At  this  stage,  P(x,t ) is  clearly  0 at  every  point  of  x 
between  0 and  1.  Thus, 


Therefore, 

Integrating  both  sides  of  this  expression,  we  have 

= const,  + J dx 

or 

W = (5.63) 

*dx 

where  C is  a constant,  such  that  (j)(x)dx  = 1. 

This  general  formula  was  first  derived  by  Wright  (1938b),  using  a different 
method.  Previously,  Wright  (1931,  1937)  had  studied  the  distributions  of 
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gene  frequencies  in  various  special  cases  which  are  biologically  important. 
Let  us  now  consider  some  special  cases  in  the  following. 

5.4.2  Neutral  genes  with  migration 

Consider  a large  number  of  partially  isolated  populations,  each  of  which 
exchanges  genes  with  a nearby  large  population  at  a rate  of  m per  generation. 
We  assume  that  the  size  of  the  large  population  is  so  large,  that  the  gene 
frequency  (x,)  of  A,  in  this  population  remains  constant  over  generations. 
This  type  of  model  is  called  the  island  model  (Wright,  1943).  Let  x be  the 
gene  frequency  of  A,  in  a partially  isolated  population.  The  mean  change  of 
x per  generation  is  then  given  by 


= m(xj  - x ) 


= - m(l  - xfix  + mxfi  1 - x),  (5.64) 

while  the  variance  is  VSx  = x(  1 - x)j(2Ne).  Therefore, 

2 1 - ~ 4NMi  - Jt ^ + w‘,nx'  / £ 

= 4iVrp?ij:(l  - x/)loge(l  - x)  + Xjlogex}  + const., 

and  thus, 

$(x)  ~ 1 “3Ir!_  1+  (5.65) 

Since 

3 1 

J $< x)dx  = C | x4^-‘ri  - 

<3  Q 

= C . B(4Nemxj,  4iVem(l  - x,))  = 1, 

C = | r(4j\» 

B(4NemXj,  4Nem(l  - x7))  r(4N emXj)r(4N em(l  - xfiY 

where  B(mp)  and  T(-)  are  the  beta  and  gamma  functions,  respectively. 

The  distribution  (5.65)  is  known  as  the  beta  distribution  in  statistics.  In 
the  case  of  Xj  = 0.5  it  is  U-shaped  if  2 Nem  < 1,  while  if  2Npn  > 1,  it  is 
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bell-shaped.  If  2Nem  = 1 exactly,  it  is  a uniform  distribution. 
(x)  and  variance  (V,)  of  gene  frequencies  are  given  by 

The  mean 

x = J x$(x)dx  = Xj, 
0 

(5M) 

(5.67) 

The  fixation  index  is  given  by 

fir  - - *)] 

= WNjn  + 1). 

(5.68) 

Therefore,  the  degree  of  differentiation  of  gene  frequencies  among  popula- 
tions becomes  high  when  the  product  of  effective  population  size  and  migra- 
tion rate  is  small.  On  the  other  hand,  the  average  heterozygosity  within 
populations  becomes 

i 

H = 2 J x{l  - x)4>(x)dx 

O 


- 2*,(I  - x&t  - FstY  (5 m 


Nei  and  Imaizumi  (1966a)  studied  the  variances  (and  also  the  covariances) 
of  the  ABO  blood  group  gene  frequencies  among  small  isolated  (mostly 
island)  populations  in  Japan.  It  is  believed  that  a small  amount  of  migration 
has  occurred  between  these  so-called  isolated  populations  and  the  general 
Japanese  population  for  many  generations.  Their  estimate  of  FST  was 
0.00191,  which  was  significantly  different  from  0.  From  the  demographic 
data  of  these  populations,  the  average  effective  size  of  the  populations  was 
estimated  to  be  1993.  Therefore,  the  migration  rate  (m)  can  be  estimated 
from  the  following  equation,  if  we  assume  that  the  stationary  distribution 
has  been  reached. 


I 

4 x 1993  x m + 1 


0.00191, 


It  becomes  0.06.  Thus,  a substantial  amount  of  migration  must  have  occurred 
between  the  isolated  populations  and  the  general  Japanese  population. 
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Wright's  (1931,  1943)  original  island  model  was  to  describe  the  genetic 
structure  of  a population  which  is  subdivided  into  many  subpopulations.  He 
equated  xx  to  the  mean  gene  frequency  of  the  whole  population.  If  the  size 
of  the  total  population  is  very  large  and  mutation  occurs  reversibly  between 
A,  and  A„  then  the  assumption  of  constancy  of  x,  is  satisfied.  In  practice, 
however,  population  size  is  not  always  large,  and,  furthermore,  according 
to  the  molecular  structure  of  the  gene,  the  forward-backward  mutation 
between  two  alleles  is  extremely  rare.  This  seriously  damages  the  assumption 
of  constancy  of  Xj  (see  section  5.5).  Strictly  speaking,  this  is  also  true  with 
the  model  described  in  the  foregoing  paragraph,  but  in  this  case  the  approxi- 
mate constancy  of  Xj  would  be  maintained  for  a certain  period  of  time  and 
if  migration  rate  is  sufficiently  large,  the  equilibrium  distribution  would  be 
reached  rather  quickly. 

Another  problem  which  arises  in  applying  the  island  model  to  a sub- 
divided population  is  that  it  does  not  take  into  account  the  possible  relation- 
ship between  migration  rate  and  geographic  distance.  More  realistic  models 
of  population  structure  in  which  this  relationship  is  taken  into  account  have 
been  studied  by  Malecot  (1948,  1950,  1967,  1969)  and  Kimura  and  Weiss 
(1964). 

5.4.3  Mutation  and  selection 


Following  Wright  (1937),  we  first  assume  that  mutations  occur  from  A,  to 
A,  with  a rate  of  u per  generation  and  from  A,  to  A z with  a rate  of  v.  Let  x 
be  the  frequency  of  A,  and  1 — s,  1 — h,  and  1 be  the  fitnesses  of  AXAX, 
A XA2,  and  A2A2,  respectively.  (Theoretically,  h and  s can  take  negative 
values.)  Then, 

= - FX  + k(J  - x)  — *{1  - x){h  -+  (s  - 2h)x} 

and  Vdx  is  the  same  as  before.  Therefore, 


2 


I 


—tf—  dx  = 4 N^v  logt(l  — x)  + Eogex 

Lfjr 


- 4jY,  | 


hx  + ^(s  - 


2h)x* 


Hence, 


#*}  - Ce  - jp)4**-1. 


(5.70) 
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It  is  noted  that  if  there  is  no  selection,  h = s = 0,  so  that  (5.70)  becomes 


m - 


^{4^,(1/  + u)} 

T{4Nju)T( 4N^v) 


\l  - x )4' 


(5.71) 


In  the  past  (5.70)  and  (5.71)  were  widely  used  in  the  literature.  However, 
simply  because  the  forward-backward  type  of  mutation  between  two  alleles 
rarely  occurs  at  the  molecular  level,  the  general  applicability  of  the  formulae 
is  questionable.  The  only  situation  to  which  (5.70)  may  be  applied  is  the 
case  where  the  same  type  of  deleterious  mutations  occur  repeatedly  at  a 
locus,  as  discussed  earlier.  Let  us  now  consider  this  special  case  in  some 
detail,  since  such  mutations  seem  to  be  quite  common.  For  example,  in 
Drosophila  lethal  mutations  occur  at  a rate  of  approximately  10“  s per  locus 
per  generation.  Many  genetic  diseases  in  man  are  also  apparently  due  to  this 
type  of  mutation. 

In  man  there  are  many  dominant  genetic  diseases  which  reduce  the  fitness 
of  heterozygotes  considerably.  Achondroplasia  is  a good  example.  The 
frequency  of  this  mutant  gene  is  so  low,  that  virtually  no  homozygotes 
appear  in  the  population.  Theoretically,  if  4Neh  » the  selection  against 
the  mutant  genes  occurs  mostly  through  heterozygotes,  and  virtually  no 
homozygotes  appear.  In  this  case,  therefore,  the  x2  term  of  the  exponent  of 
e in  (5.70)  may  be  neglected.  Also,  since  Ax  is  a deleterious  gene  and  the 
frequency  x is  very  small,  the  backward  mutation  may  be  neglected.  There- 
fore, noting  that  (1  - x)_1  = 1 when  x is  small,  we  obtain  the  following 
approximate  formula. 


r(4Nji) 


<5.72) 


This  type  of  distribution  is  called  the  gamma  distribution  in  statistics,  and 
the  mean  and  the  variance  are  approximately  given  by 


x = ujh 


(S.733 


and 


Vx  = uj(4Nth% 


(5,74) 


respectively. 

In  Drosophila  a large  number  of  experiments  have  been  conducted  on  the 
mechanism  of  maintenance  of  lethal  genes.  In  these  experiments  the  quantity 
observed  is  not  the  frequency  of  lethal  genes  at  a locus  but  the  frequency 
of  lethal  bearing  chromosomes.  Let  Q be  the  proportion  of  chromosomes 
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carrying  one  or  more  lethal  genes.  If  we  assume  independent  distribution  of 
lethal  genes  at  different  loci, 


i - e-ri  a - *,)  = e-'1", 

j-  i 

where  xt  is  the  frequency  of  the  lethal  gene  at  the  i-th  locus  and  r is  the  total 
number  of  lethal  loci.  Thus,  Qy  = — loge(l  - Q)  = Y,ixi-  Since  a sum  of 
gamma  variates  is  again  distributed  as  a gamma  variate,  the  distribution 
of  Qx  is  given  by 


Md 


(4i Srh)4l*'v 
r{4NeU) 


>Q,q*NbU-  L 


(5.7  5} 


where  U = j;i  in  which  ut  is  the  mutation  rate  at  the  i-th  locus  (Nei,  1968). 
The  mean  (Q,)  and  variance  (V,  ,)  of  Qy  are  approximately  given  by 


e.  = vih 


(5.76) 


Vel  = U/{4  N,h!).  (5.77) 

Murata(1970)  maintained  51  small  populations  of  Drosophila  melanogaster 
and  examined  the  frequency  of  lethal  chromosomes  in  each  population  during 
the  62nd  to  72nd  generations.  Each  population  consisted  of  25  males  and 
25  females,  and  the  test  was  made  only  for  the  second  chromosome.  The 
frequency  distribution  of  lethal  chromosomes  obtained  is  given  in  fig.  5.6 
together  with  the  theoretical  curve  given  by  (5.75).  The  fit  of  the  theoretical 
curve  to  the  data  seems  to  be  satisfactory.  The  mean  and  variance  of  Q± 
are  0.115  and  0.01503,  respectively.  After  making  a small  correction  for  the 
sampling  variance,  the  heterozygous  effect  of  lethal  genes  and  the  mutation 
rate  per  chromosome  can  be  estimated  by  using  (5.76)  and  (5.77),  assuming 
Ne  = 50.  They  become  0.038  and  0.0044,  respectively.  Thus,  lethal  genes 
appear  to  reduce  the  fitness  of  heterozygotes  by  about  4 percent  on  the 
average.  It  is  noted  that  the  estimate  of  the  lethal  mutations  is  very  close  to 
the  generally  accepted  value,  0.005,  for  this  chromosome  (Crow  and  Temin, 
1964). 

Some  deleterious  genes  are  apparently  completely  recessive.  In  this  case 
(5.70)  can  be  approximated  by 

«*>  = «'  (5-78) 

where  s > 0.5  is  assumed  (Wright,  1937;  Nei,  1968).  This  is  somewhat 
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Fig.  5.6.  Observed  and  expected  frequency  distributions  of  lethal  second  chromosomes  in 
small  populations  of  Drosophila  melanogaster.  The  theoretical  curve  is  given  by  51  X 
7.7  x e_7-7Qi  x 0.05  = 19.64  x e7'T(iiin  which  4NeU  is  assumed  to  be  1.  From  Murata 
(1970). 


similar  to  the  gamma  distribution.  When  4Neu  < I,  the  distribution  becomes 
inverted  J-shaped,  and  0(x)  increases  as  x -*  0.  The  frequency  of  lethal  genes 
varies  considerably  even  in  moderately  large  populations.  The  probability 
that  no  lethal  genes  exist  in  the  population  is  given  by 

1/2N 

/((I)-  J «*)cU  a.  *(l/2N)/(4^>  (5-79) 

0 

approximately  (Wright,  1931;  Kimura,  1968b).  If  Ne  = N,  this  probability 
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is  15  percent  for  N = JQ4,  87  percent  for  Ne  = and  99  percent  for 
Ne  = 100  (Wright,  1969). 

The  mean  of  distribution  (5.78)  is  given  by 


r(2Nru  + 1/2) 

\f2N^  r(2N<u) 


This  becomes  yj(u/s),  if  Ne  ->  co,  and  agrees  with  the  result  of  the  deter- 
ministic approach.  On  the  other  hand,  if  Neu  < 0.01, 


£ = 


(5.31) 


approximately.  Fig.  5.7  shows  the  relationship  between  x and  Ne  given  by 
(5.80).  In  this  figure  the  same  relationships  for  partially  recessive  and  over- 
dominant lethals  are  also  included.  These  relationships  were  obtained  by 
(5.73)  and  numerical  integrations  of  (5.70).  It  is  seen  that  in  the  case  of 
completely  recessive  lethals  the  mean  gene  frequency  in  small  populations 
is  considerably  smaller  than  the  value  of  yj(u/s ) = 0.0033;  for  the  mean  gene 
frequency  to  become  close  to  yj(u/s)  population  size  must  be  of  the  order  of 
IQ6,  This  is  also  true  with  overdominant  lethals.  On  the  other  hand,  the 
frequency  of  partially  recessive  lethals  is  independent  of  population  size 
except  in  very  small  populations. 


Fig.  5.7.  Mean  frequencies  of  lethal  genes  in  equilibrium  populations.  For  overdominant 
lethals  si  - 1.00  and  sa  = 0.01  are  assumed,  while  the  value  of  h for  partially  recessive 
lethals  is  0.03.  The  mutation  rate  is  assumed  to  be  10-5  for  all  three  kinds  of  lethnls. 
From  Nci  (1969b). 
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5.4,4  Neutral  mutations 


As  noted  earlier,  there  are  a large  number  of  possible  alleles  at  a locus  at  the 
nucleotide  or  codon  level.  Following  Kirnura  (1968b),  let  us  assume  that 
there  are  k possible  alleles  at  a locus  and  each  allele  mutates  with  a frequency 
of  vj(k  - 1)  to  one  of  k - 1 remaining  alleles,  so  that  v is  the  mutation  rate 
per  gene  per  generation.  Denote  by  x the  frequency  of  a particular  allele  in  a 
population.  On  the  assumption  that  all  alleles  are  selectively  neutral,  the 
mean  change  of  gene  frequency  per  generation  is  given  by 


Mix  = ~ vx  + (1  - x)pir  (5.82) 

where  vl  = v/(k  - 1).  Therefore,  the  stationary  distribution  of  gene  fre- 
quency x may  be  expressed  by  (5.71),  replacing  u by  vl.  Namely, 


= 


r(M  + M')„  .M— i h'-i 

— r (!  — x)  x 


where  M = 4Nev  and  M'  = Mj(k  - 1).  Clearly,  the  mean  of  x is 


E(x)  = x - 1/Jfc. 


(5.84) 


Since  the  total  number  of  possible  alleles  is  k and  each  allele  behaves 
independently  in  the  same  way,  the  expected  number  of  alleles  whose 
frequency  is  from  x to  x + dx  is  given  by  k<f>(x)dx.  In  practice  k is  very  large, 
so  that  the  distribution  of  the  expected  number  of  alleles  is  given  by 


<P(x)  = lim 

k->oo 


kr(M  + M') 

r(M)r(Mf) 


(i 


xf-'x^-1 


= M(l  “ (5.85) 

approximately.  Note  that  r(M')  ->  1 fM'  as  M'  ->■  0.  This  formula  was  first 
derived  by  Kirnura  and  Crow  (1964). 

As  mentioned  earlier,  the  homozygosity  at  a locus  is  given  by  where 
xt  is  the  frequency  of  the  i-th  allele.  The  expectation  of  homozygosity  is 


J = E{Exf)  = J x2M{  1 - x)w_3x_1djf 

9 


= + a 


(5,86) 


Therefore,  the  expected  heterozygosity  is 
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H = \ — J — Mj{M  + I).  (5,57) 

As  expected,  H is  large  when  4 Nev  is  large. 

The  average  number  of  alleles  per  locus  is  equal  to  the  reciprocal  of  the 
mean  frequency  of  alleles  existing  in  the  population  (Wright,  1948b;  Ewens, 
1964;  Kimura,  1968b).  Clearly, 

Mcaafx  # 0]  = jC/Jl  -/((>)},  {5M) 

where  f(0)  = (5.79).  Since  x = \jk,  the  average  number  of 

alleles  is 

na  - lim  k(l  — f (0)} 

00 

I 

= f M(l  - x)w_ljc_ldx.  (5.89) 

l/irr 

Ewens  (1972)  has  shown  that  if  n alleles  are  sampled  at  random  from  this 
population,  the  expected  number  of  alleles  in  the  sample  is  given  by 


M M M M 

M + JW  + 1 + M + 2 + JW  + n - 1 


vm 


Note  that  na  is  different  from  the  effective  number  of  alleles  defined  by 
Kimura  and  Crow  (1964),  i.e. 

= \[E(Exf)  = M + h (5.91) 

The  effective  number  is  equal  to  the  actual  number  ( ne ) only  when  all  allele 
frequencies  are  the  same.  Otherwise,  the  former  is  smaller  than  the  latter. 

Another  parameter  which  is  often  useful  is  the  proportion  of  polymorphic 
loci.  We  define  a locus  as  polymorphic  if  the  frequency  of  the  commonest 
allele  is  equal  to  or  less  than  1 - q,  where  q is  a small  quantity.  The  most 
commonly  used  value  of  q is  0.01.  If  all  loci  have  the  same  mutation  rate, 
then  the  expected  proportion  of  polymorphic  loci  may  be  obtained  by 

i 

P = l — lim  k I ^(x)dx 

**■50  J 

J -<T 

-I  -qM  {5.92) 

(Kimura,  1971).  In  many  organisms  M is  about  0.1.  If  we  use  q = 0.01,  then 
P = 0.37.  This  roughly  agrees  with  the  actual  observations  (eh.  6). 
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5.4.5  Distribution  ureter  irreversible  mutation 

Natural  populations  often  contain  many  alleles  at  a locus  (cistron).  Thus, 
if  we  consider  mutations  at  the  level  of  cistron,  the  theory  in  the  foregoing 
subsection  is  appropriate.  However,  at  the  codon  or  nucleotide  level  the 
mutation  rate  is  so  low,  that  a population  is  almost  always  monomorphic  or 
polymorphic  just  for  two  types,  i.e.,  the  mutant  type  (A,)  and  original  type 
(A,).  Reversible  mutation  is  virtually  negligible  while  they  are  polymorphic. 
Namely,  the  two-allele  theory  with  irreversible  mutation  applies.  In  this 
case  every  codon  may  mutate  independently  and  the  mutant  type  may 
increase  or  decrease  in  frequency.  At  equilibrium  when  the  effects  of  muta- 
tion, selection,  and  genetic  drift  are  balanced,  it  is  expected  that  the  frequency 
of  mutant  codons  reaches  some  form  of  stable  distribution.  We  shall  now 
study  this  distribution  together  with  such  a quantity  as  the  expected  number 
of  heterozygous  codons  per  locus.  We  shall  follow  Kimura's  (1969a)  method, 
assuming  that  in  populations  each  codon  behaves  independently,  though  this 
is  not  necessarily  true  for  closely  linked  codons. 

Let  fi  be  the  mutation  rate  per  codon  per  generation.  Thus,  if  there  are  n 
codons  at  a locus,  the  total  number  of  mutant  codons  arising  in  each 
generation  is  2 Nnp  = 2Nv.  We  have  defined  </>(/?, x;  t)  as  the  probability 
density  that  the  gene  frequency  becomes  x at  time  t,  given  that  it  is  p at 
time  0.  We  now  consider  the  distribution,  <£(/?,  x),  of  the  expected  number 
of  mutant  codons  whose  frequency  is  x at  equilibrium.  Since  2 Nv  mutations 
occur  every  generation,  we  have 


<Hp,  x)  * 2Nv  J x;  t)dt,  (5,93) 

o 

where  p is  the  initial  frequency  of  mutant  codons.  Therefore,  the  expectation 
of  an  arbitrary  function  of  gene  frequency,  /(x),  is  given  by 

t 

F(P)  “ (5.94) 

0 

where  the  integral  is  over  the  open  interval  (0,1),  since  we  are  considering 
only  the  polymorphic  codons  [x  = 1/(2 N)  ~ (2N  - 1)/(2A)].  An  important 
parameter  is  the  expected  number  of  heterozygous  codons  per  locus.  In  this 
case  /(x)  = 2x(l  - x). 


120 


Mutant  genes  in  finite  populations 


The  solution  for  F(p ) can  be  obtained  by  a method  similar  to  that  for  the 
average  fixation  time  (Kimura,  1969a).  The  result  is  given  by 

r i 

F(p)  = {1  - J ^ffiz)u{z)dz  + u(p)  | ^U)(  I - w(z)jd:r,  (5.95) 

0 fi 

where  u(p)  is  the  probability  of  ultimate  fixation  given  by  (5.34)  and 

*/*>  = 4W(Z)  J (S.%) 

o 

The  expected  number  of  heterozygous  codons  ( H(p ))  can  be  computed  by 
putting/(v)  = 2x(l  - x).  In  the  case  of  no  selection  G(x)  = exp  {-2 HMJ 
Vdx)dx}  = 1,  so  that  = 1 assuming  Ne  = N.  We  also  know  that 
u(p)  = p,  where  p = 1 /(2N)  in  the  present  case.  Therefore, 

H{iJ2N)  = %N2vp(\  - p)  ft  WPr  (5,97) 

If  the  mutant  is  advantageous  without  dominance  (W„  = 1,  W X1  = 1 + s, 
WX1  = 1 + 2s)  and  4 Ns  » 1,  it  can  be  shown  that 

ss  SJVtJ  (5.9B) 

approximately.  Therefore,  advantageous  genes  contribute  to  heterozygosity 
twice  as  much  as  neutral  genes,  if  mutation  rate  is  the  same.  In  practice, 
however,  the  rate  of  advantageous  mutations  is  likely  to  be  much  smaller 
than  the  rate  of  neutral  mutations  (ch.  6). 

Formula  (5.95)  can  be  used  for  computing  any  function  of  x.  Using  this 
formula,  Kimura  has  studied  the  variance  of  the  number  of  heterozygous 
codons  and  the  number  of  segregating  codons.  It  can  also  be  used  for 
deriving  the  distribution  function  <f>(py  x)  itself.  In  this  case  we  put  f{x)  = 
<5(x  - y),  where  ${■)  is  the  Dirac  delta  function,  so  that  - y)dx  = 

f(y).  Therefore, 

t 

$,(z)  = - JO  J 

□ 

and,  if  we  note  p = 1/(2 N)  and  1/(2 N)  <,  y <,  1 - 1/(2tV),  then  the  first 
integral  of  (5.95)  vanishes  since  5(z  - y)  = 0.  Therefore,  the  distribution  is 
given  by 
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J G(x)dX 

s * {-JtP >')  - ANt'“  Ur)  *'  - "<*»  \coT-  ^ 

Noting  that  u(E/2jV)  = ( t/ZAf  G(x)dx  approximately,  and  using  a*  instead 
of  y for  representing  the  gene  frequency,  the  above  formula  reduces  to 

i i 

*j(*)  = ys  J G(z)dz/  J G{z)dz.  (5.100) 

* 0 

The  above  formula  is  due  to  Kimura  (1964,  1969a),  but  equivalent  formulae 
for  special  cases  had  been  obtained  by  Fisher  (1930)  and  Wright  (1938b, 
1942,  1945).  Ewens  (1963b,  1969)  also  derived  a formula  equivalent  to 
(5.99)  independently. 

In  the  case  of  no  selection  (5.100)  reduces  to 


0,(jc)  - 4JVo/jf, 


(5.101) 


while  for  advantageous  mutations  with  no  dominance  it  becomes 


4JVej  1 


x(i  - x)  t 


(5.102) 


Later,  we  shall  use  these  formulae  for  testing  the  neutral  mutation  hypo- 
thesis. 


5.5  Genetic  differentiation  ofpopulations 

5.5.1  Differentiation  with  migration 

In  section  5.4  we  studied  Wright's  island  model  without  mutation.  Let  us 
now  extend  this  model  to  the  case  of  infinite  number  of  possible  alleles 
with  mutation.  We  shall  also  remove  the  assumption  of  an  infinite  number 
of  subpopulations.  We  assume  that  there  are  s subpopulations  of  effective 
size  N and  immigrants  into  a subpopulation  are  a random  sample  of  indi- 
viduals from  the  whole  population.  We  denote  the  migration  rate  by  m and 
the  mutation  rate  by  u.  Let  J0  be  the  probability  of  identity  of  two  randomly 
chosen  genes  from  a subpopulation,  and  Jl  be  the  probability  of  identity  of 
two  random  genes,  one  from  each  of  two  subpopulations.  Clearly,  J0  is 
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equal  to  the  expected  homozygosity  within  populations,  i.e.  J0  = 
where  xt  is  the  frequency  of  the  i-th  allele  in  a subpopulation.  On  the  other 
hand,  J,  is  given  by  where  xt  and  are  the  frequencies  of  the  i-th 

allele  in  two  populations.  We  have  seen  that  when  there  is  no  migration 
and  no  mutation  the  recurrence  equation  for  J0  is  given  by  J(^+ 1 ) = 1 /( '2N ) + 
{1  - 1/(2AO}J()r3,  where  the  superscript  t refers  to  generation  (5.13).  We  now 
assume  that  sampling  of  genes,  migration,  and  mutation  occur  in  this  order. 
Then,  following  Malecot  (1969)  and  Maruyama  (1970b),  we  can  derive  the 
following  recurrence  equations  for  J0  and 

Jg*‘>  = (I  - [o  + (l  - JS'I  + (1  - HyV’],  (5.103a) 

j‘r"  = (i  - v)1  [t{~  + (i  --jh  )k']  + d - wr].  (5.io3i>) 

where  a = (1  - m)J  + m(2  - m)/s  and  b = m(2  - m)/s. 

It  is  not  difficult  to  obtain  general  formulae  for  and  J from  the  above 
equations,  but  they  are  too  complicated  to  be  useful  (see  Latter,  1973a,  for 
a slightly  different  model).  The  equilibrium  values  of  J0  and  Jx  are,  however, 
obtained  easily  by  putting  J£+l)  = J\ and 
They  become 

= (I  - „>![0  - (I  - ™)J<1  - v^Wm,  (5.104a) 

J\->  = «]  - iV/ONG)  (5,104b) 

(Maruyama,  1970b,  with  a small  correction),  where 

C=  1 - (1  ■-»)*  [l  +('  +(' 

Nei  (1972)  has  defined  the  normalized  identity  of  genes  between  two 
populations  as 

/ ^ Jxvf'JJx'J  (3,105) 

where  J x and  JY  are  the  values  of  J0  in  populations  X and  Y , respectively, 
and  JXY  is  the  value  of  J,  between  X and  Y.  In  the  present  case  JXY  = J, 
for  any  pair  of  subpopulations  and  Jx  = JY  = J,.  Therefore,  we  have 

l = m{2  - mMl  - - (I  ■ it)1}  + m(2  - tti)] 

~ m(2  - m)/[2«(l  - m?  + m(2  - ai)]. 


(5.106) 
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Thus,  as  long  as  vs  is  small  compared  with  in,  / is  close  to  1 and  the  gene 
differentiation  between  populations  is  small.  For  the  gene  differentiation 
to  be  substantially  large,  migration  rate  must  be  very  small. 

In  the  above  island  model  the  geographic  distance  between  populations 
is  disregarded.  Maruyama  (1970b,  c,  d,  1973)  studied  the  relationship 
between  JXY  and  distance,  assuming  that  v is  finite.  The  results  obtained 
indicate  that  in  the  case  of  one-dimensional  distribution  JXY  declines  roughly 
exponentially  as  distance  increases,  but  the  rate  of  decline  depends  on  the 
total  length  of  distribution  and  migration  distance.  In  the  case  of  two- 
dimensional  distribution  JXY  rapidly  declines  as  distance  increases  and  the 
relationship  between  JXY  and  distance  is  quite  different  from  the  results  of 
Malecot  (1950,  1967,  1969)  and  Kimura  and  Weiss  (1964)  who  assumed  an 
infinitely  large  number  of  subpopulations.  Furthermore,  the  value  of  I can 
be  close  to  1 even  if  the  distance  is  a thousand  times  larger  than  the  migration 
distance  (Maruyama  and  Kimura,  1974). 

Another  measure  of  population  differentiation  is 

CST  = D5T!HT,  (5.107) 

where  HT  is  the  gene  diversity  in  the  total  population  and  DST  the  inter- 
populational  gene  diversity,  as  will  be  defined  in  chapter  6,  GST  is  an  extension 
of  Fst  for  the  case  of  multiple  alleles.  In  the  present  case  DST  = (1  - \js) 
(/0  - J,)  and  Ht  = I - J0  + Dst  =1  - J,  - (J0  - Jx)js.  Therefore, 


ti  - 1X1  - W)3q  - g)a[!  - (i  - »>)*] 
2NsG  - - mK  I l;);  (I  Vi.  I Dll  (1 


(ii  m 


(Nei,  1974).  It  is  clear  that,  unlike  FST,  GST  depends  on  all  the  parameters 
involved.  In  the  case  of  ^ = co  and  m « 1,  we  have  GST  = + 1), 

which  is  equal  to  Fst-  However,  the  applicability  of  this  formula  is  question- 
able, since  in  the  case  of  s = co,  HT  = 1,  which  would  never  occur  in  nature. 

Crow  and  Maruyama  (1972)  studied  the  relationship  between  JT  = 
1 — Ht  and  J0  and  showed  that  at  equilibrium 


- JtT’KI  ~-)J 

4Nr(2i  - c!) 


& 


1 - 
4Ntv 


(5J09) 


for  any  type  of  migration,  where  NT  is  the  total  population  size.  In  the 
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present  case  this  is  easily  proved  by  substituting  (5.104)  into  js  + 

(s  - 1 )/<«>/*, 

It  should  be  noted  that  formulae  (5.106),  (5.108),  and  (5.109)  depend  on 
the  assumption  that  the  population  is  in  equilibrium  with  respect  to  the  effects 
of  mutation,  migration,  and  genetic  drift.  Strictly  speaking,  in  order  for  this 
equilibrium  to  be  reached  the  breeding  structure  of  the  population  should 
remain  constant  for  a large  number  of  generations  - of  the  order  of  magnitude 
of  the  reciprocal  of  mutation  rate  (Nei  and  Feldman,  1972). 

5.5.2  Gene  dijferentiation  under  complete  isolation 

We  have  seen  that,  as  far  as  concerned  with  neutral  genes,  a substantial 
differentiation  of  genes  among  populations  occurs  only  when  there  is  little 
or  no  migration.  Let  us  now  consider  how  the  gene  differentiation  proceeds 
under  complete  isolation. 

With  no  migration  (5.103a)  and  (103b)  reduce  to 

= — * [^+(.  --^)4 

JV"  = (1  - vfdf. 

Therefore, 

w = jn  [o  - 

* + GC  - 

JV  = (1  - v)2tj[0) 

as  J(0)e~2vt  (5.110b) 

where 

jST’  = (i  - i>)7[2w  - (2jv  - 1)0  - 1):] 

=s  1/(4jVc  + I).  (5.1  M) 

A formula  equivalent  to  (5.110a)  was  first  derived  by  Malecot  (1948). 
Formula  (5.111)  is  the  same  as  (5.86)  as  expected. 

The  differentiation  of  subpopulations  can  again  be  measured  by  (5.107), 


(5.110a) 


Cicitctic  iliJJiTCHiiaiiiur  of  pupitfoiimis 
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in  which  DST  = (1  - - J^)  and  HT  = I — JW  - (/</>  - 

If  there  is  no  mutation  and  J'^0)  = J{°\  then 


(1  - IM1  - 
I - (1  - e""5*)/*  ' 


(5.112) 


Therefore,  if  ^ = oo,  this  agrees  with  the  formula  for  fsj-(5.9),  as  expected. 
Clearly,  GST  is  a more  general  formula  than  FS1, 

When  a population  splits  into  s isolated  populations  but  the  size  of  each 
descendant  population  remains  the  same  as  that  of  the  ancestral  population, 
then  we  would  expect  that  J^0)  = Ji0>  = J(0m).  In  this  case  we  have 


G,r  = 


<1  - mj[rv  • 


«) 


(II 1 3) 


i - + (i  - 

Thus,  the  population  differentiation  now  depends  on  mutation  rate.  It  is 
also  noted  that  CSr>  an  extension  of  FST,  is  entirely  different  from 
which  remains  constant  in  this  case.  Namely,  Wright's  fixation  index  and 
homozygosity  are  different  concepts,  though  they  become  identical  under 
certain  circumstances. 

In  the  presence  of  mutation  J\n  = while  J ^ if  the  homo- 

zygosity is  in  equilibrium.  Therefore,  if  /(10)  = JhQ\ 


■■  Ji'W 


(3.114) 


(Nei  and  Feldman,  1972).  That  is,  I declines  exponentially  as  t increases. 
We  shall  discuss  this  problem  in  more  detail  later. 
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6.1  Introductory  remarks 

Natural  populations  contain  a large  amount  of  variability  both  in  qualitative 
and  quantitative  characters.  Some  part  of  this  variability  is  evidently  en- 
vironmental, but  a large  part  is  genetic.  Quantitative  characters  such  as 
stature  and  IQ  are  generally  affected  by  both  genetic  and  environmental 
factors.  The  proportion  of  genetic  variation  in  these  characters  is  usually 
measured  by  a quantity  called  heritability,  which  is  defined  as  the  proportion 
of  genetic  variance  among  the  total  phenotypic  variance.  This  heritability 
amounts  to  10  ~ 50  percent  in  many  quantitative  characters  (Falconer, 
1960).  On  the  other  hand,  the  variation  in  qualitative  characters  such  as  blood 
groups  and  color  blindness  is  almost  exclusively  determined  by  genetic 
factors.  These  genetic  variations  are,  of  course,  caused  by  the  genic  variation 
at  the  DNA  level,  and  naturally  we  are  interested  in  the  question:  how 
variable  are  genes  in  a population? 

Historically,  the  extent  of  genetic  variability  in  natural  populations  was 
first  studied  with  quantitative  characters.  It  soon  became  apparent  that  a 
large  fraction  of  the  variability  of  these  characters  is  genetic  (Fisher,  1918) 
and,  furthermore,  there  is  a large  amount  of  hidden  genetic  variation  which 
can  be  detected  only  by  artificial  selection  (Mather,  1949).  But  these  studies 
could  not  give  much  insight  into  the  variation  at  the  gene  level,  since  the 
relationship  between  the  phenotypes  of  these  characters  and  genes  is  so 
complicated.  The  genic  variation  was  then  studied  by  examining  the  fre- 
quency of  deleterious  genes  in  natural  populations  (Sturtevant,  1937; 
Dobzhansky  and  Wright,  1941;  and  others).  Deleterious  genes  are  mostly 
recessive,  so  that  they  are  identified  by  means  of  inbreeding.  These  studies 
revealed  that  natural  populations  contain  a large  amount  of  deleterious  genes 
in  concealed  form  (see  Dobzhansky,  1970).  This  approach  was,  however, 
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still  far  from  knowing  the  total  amount  of  genic  variation,  since  this  method 
detects  only  those  genes  which  produce  a drastic  phenotypic  effect  or  a 
substantial  reduction  in  viability  or  fertility. 

A more  complete  answer  to  this  question  came  through  the  development 
of  molecular  biology.  On  the  theoretical  side,  Kimura  and  Crow  (1964) 
showed  that  the  number  of  alleles  at  a locus  that  can  be  maintained  in  a 
finite  population  is  fairly  large,  taking  into  account  the  fact  that  at  the 
molecular  level  almost  an  infinite  number  of  alleles  may  be  produced  at  a 
locus.  On  the  other  hand,  the  development  of  starch  gel  electrophoresis 
(Smithies,  1955)  in  combination  with  a simple  staining  technique  for  a 
specific  enzyme  activity  (Hunter  and  Markert,  1957)  provided  a valuable 
tool  by  which  genetic  heterogeneity  of  proteins  and  isozymes  can  easily  be 
detected.  By  1965,  it  was  already  known  that  natural  populations  contain 
a large  amount  of  polymorphism  with  respect  to  proteins  and  enzymes.  In  a 
review  article,  Shaw  (1965)  stated  that  'enzymes  which  vary  (within  popula- 
tions) are  the  rule  rather  than  the  exception'.  An  important  step  in  the  study 
of  genic  variation  in  populations  was  made  by  Lewontin  and  Hubby  (1966) 
and  Harris  (1966).  These  authors  studied  the  polymorphism  of  a large 
number  of  protein  loci  that  are  presumably  a random  sample  of  the  genome, 
and  showed  that  about  30  percent  of  the  gene  loci  are  polymorphic  with 
respect  to  electrophoretically  detectable  proteins.  Since  then,  a large  number 
of  studies  on  protein  polymorphisms  have  been  done  in  many  different 
species,  and  it  is  now  clear  that  most  natural  populations  contain  a large 
amount  of  genic  variability.  Before  the  advent  of  molecular  biology,  it  was 
known  that  a certain  class  of  genes  such  as  those  for  blood  groups  in  man 
are  quite  polymorphic.  However,  nobody  was  sure  about  how  representative 
they  were  in  the  total  genome. 

In  the  present  chapter  I shall  discuss  the  extent  of  genic  variation  at  the 
molecular  level  and  the  mechanism  of  maintenance  of  the  variation. 

6.2  Measures  of  genic  variation 

The  genic  variation  of  a population  is  usually  measured  by  the  proportion 
of  polymorphic  loci  and  the  average  heterozygosity  per  locus.  A locus  is 
defined  as  polymorphic  if  the  frequency  of  the  commonest  allele  is  equal 
to  or  less  than  0.99.  This  definition  is  clearly  arbitrary  and  there  is  no  reason 
why  the  distinction  between  polymorphic  and  monomorphic  loci  should 
not  be  made  at  0.95  or  0.995  or  at  some  other  value.  On  the  other  hand,  the 
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homozygosity  and  heterozygosity  at  a locus  are  defined  as  j = and 
h = 1 - respectively,  where  xt  is  the  frequency  of  the  i-th  allele. 

Average  homozygosity  (J)  and  heterozygosity  (ft)  are  the  means  of  these 
quantities  over  all  loci  examined.  Thus,  average  heterozygosity  can  be 
defined  unambiguously  and  also  it  has  a number  of  good  properties  from 
the  theoretical  point  of  view,  as  discussed  in  ch.  5.  For  these  reasons, 
average  heterozygosity  is  a better  measure  of  genic  variation  than  the 
proportion  of  polymorphic  loci.  Nevertheless,  we  shall  use  the  latter  measure 
in  some  limited  cases,  since  it  gives  a rough  idea  of  the  extent  of  polymor- 
phism. 

The  concept  of  homozygosity  and  heterozygosity  was  developed  with  respect 
to  random  mating  populations.  In  nonrandom  mating  populations  the 
heterozygosity  defined  above  is  not  related  to  the  frequency  of  heterozygotes 
in  the  population.  Nevertheless,  it  is  a good  measure  of  genic  variation  in  a 
population;  it  can  be  used  for  any  organism,  whether  it  is  a self-fertilizer  or 
outbreeder  or  whether  it  is  haploid  or  polyploid.  In  these  organisms,  how- 
ever, the  word  heterozygosity  is  not  appropriate.  Therefore,  I have  called  H 
gene  diversity  as  a general  term  (Nei,  1973c).  I have  also  called  Jgene  identity. 
These  words  are  particularly  useful  for  describing  the  genic  variability  of 
a subdivided  population.  In  the  following  we  use  both  heterozygosity  and 
gene  diversity,  depending  on  the  situation. 

The  genic  variation  of  a population  can  also  be  measured  by  the  average 
number  of  codon  differences  between  randomly  chosen  genes.  Since  there 
must  be  at  least  one  codon  difference  between  any  pair  of  different  alleles, 
the  minimum  number  of  codon  differences  per  locus  between  two  randomly 
chosen  genomes  can  be  estimated  by 

Dx<.,  = 1 -J,  (6.1) 

where  J is  the  probability  of  gene  identity  (homozygosity)  per  locus.  Thus, 
£ i^j  is  equal  to  average  heterozygosity  or  gene  diversity. 

A more  appropriate  estimate  of  codon  differences  per  locus  may  be  ob- 
tained by 

Dx  = - Tog tJ.  (6.2) 

The  rationale  of  this  formula  is  as  follows:  Consider  a cistron  composed 
of  n codons,  and  let  Sc  be  the  probability  that  the  i-th  codon  is  different 
between  two  randomly  chosen  cistrons  (genes).  If  is  independent  of  3j 
for  any  pair  of  i and  / (i  ^ /),  the  probability  that  two  randomly  chosen 
cistrons  have  an  identical  codon  sequence  is 
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p - fi  (i  - sj 

1 = 1 

where  P is  the  expected  gene  identity  per  locus  and  Dc  = is  the  expected 
number  of  codon  differences  per  locus  (Kimura,  1969a).  Thus,  equating  P 
to  J,  D,  may  be  estimated  by  D,.  Tn  practice,  the  codons  in  a cistron  are 
closely  linked  and  recombination  rarely  occurs  among  them  except  in 
microorganisms.  Therefore,  (6.2)  is  expected  to  give  an  underestimate  of  the 
number  of  codon  differences.  In  the  foregoing  chapter  we  have  seen  that  in 
the  absence  of  selection  the  expectation  of  J = 1 - H = 1 j(4Nv  + 1),  while 
the  expected  number  of  heterozygous  codons  per  locus  is  H(\/2N)  = 4Nv. 
Thus,  if  4Nv  is  small,  then  Dx  = - loge/  « 4Nv,  as  expected. 

In  equating  P to  J,  we  have  implicitly  assumed  that  D,  is  the  same  for 
all  loci.  If  this  assumption  does  not  hold,  Dx  may  still  be  an  underestimate 
of  the  average  number  of  codon  differences  per  locus,  Dc.  A correction  for 
this  factor  can  be  made  by  using  the  geometric  mean  (J')  rather  than  the 
arithmetic  mean  (J)of  gene  identities  for  different  loci  (Nei,  1973a).  That 
is,  Dc  can  be  estimated  by 

£>i  - — lofcT.  (6.3) 

The  concept  of  ‘codon  differences'  is  useful  in  measuring  the  gene  differences 
between  two  populations  or  in  partitioning  the  gene  diversity  in  subdivided 
populations  into  its  components,  as  will  be  seen  later.  In  practice,  of  course, 
all  the  above  estimates  refer  to  those  codon  differences  that  are  detectable 
by  the  technique  used.  For  example,  electrophoresis  detects  only  about 
25  percent  of  the  actual  codon  (amino  acid)  differences.  Furthermore,  in  this 
method  each  mutational  change  of  a gene  is  counted  as  one  codon  difference 
even  if  it  involves  many  codon  changes  as  in  the  case  of  the  haptoglobin  a2 
allele.  For  lack  of  a better  alternative,  however,  we  shall  use  the  term  ‘codon 
differences'. 

There  are  some  other  measures  of  genic  variation  of  a population.  Some 
authors  have  used  the  average  number  of  alleles  per  locus.  Although  this 
parameter  seems  to  be  important  in  the  study  of  bottleneck  effect  (Nei  et  ah, 
1975),  it  has  a large  sampling  variance  and  when  sample  size  is  small  it  can 
be  a gross  underestimate  of  the  actual  number  in  the  population.  On  the 
other  hand,  if  sample  size  is  large,  it  may  include  many  deleterious  genes 
most  of  which  are  of  low  frequency  and  barely  contribute  to  the  genic 
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variation  of  a population.  A slightly  different  measure  suggested  by  Kimura 
and  Crow  (1964)  is  the  effective  number  of  alleles  per  locus.  This  measure 
is,  however,  simply  the  reciprocal  of  homozygosity,  and  its  statistical 
properties  are  not  as  good  as  those  of  heterozygosity. 

Lewontin  (1972)  and  Selander  and  Johnson  (personal  communication, 
1972)  have  used  the  Shannon  information  index  to  measure  genic  variation. 
This  index  is,  however,  designed  to  measure  the  amount  of  information  in 
information  engineering  and  is  not  related  to  any  genetic  entity;  it  is  not  clear 
what  the  absolute  value  of  this  quantity  means  in  terms  of  genetic  materials. 

At  any  rate,  average  heterozygosity  or  gene  diversity  seems  to  be  the  best 
parameter  to  measure  genic  variation.  The  sampling  property  of  this  para- 
meter has  also  been  worked  out.  The  theoretical  variance  of  the  estimate  of 
heterozygosity  at  a locus  (h  = 1 - is  given  by 

W = 2,"-7~  {O  - in)A  + + A-  <6.4j 

where  j = 1 — h and  n is  the  number  of  genes  sampled  (Nei  and  Roy- 
choudhury,  1974a). 

Heterozygosity,  however,  generally  varies  considerably  with  locus,  and 
thus  the  variance  of  average  heterozygosity  of  a population  includes  the 
interlocus  variance.  If  gene  frequencies  for  r loci  are  studied,  the  average 
heterozygosity  (H)  and  its  sampling  variance  can  be  estimated  by 

«~  £ hjr,  (6.5) 

J-l 

and 


V\H)  - X to  - - 1)},  (6.6) 

f=i 

respectively,  where  subscript  / refers  to  the  I-th  locus.  Some  authors  have 
estimated  average  heterozygosity  by  computing  the  actual  proportion  of 
heterozygotes  in  the  population.  This  quantity,  however,  has  a rather  poor 
statistical  property  particularly  in  small  populations  (Nei  and  Roychoudhury, 
1974a). 

For  estimating  average  heterozygosity  or  gene  diversity,  a large  number  of 
loci,  which  are  ideally  a random  sample  of  the  genome,  should  be  examined. 
The  number  of  individuals  to  be  studied  per  locus  can  be  rather  small  (about 
20  individuals).  Formulae  (6.5)  and  (6.6)  can  be  used  in  any  organism 
irrespective  of  its  reproductive  system.  On  the  other  hand,  (6.4)  depends  on 
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the  assumption  of  the  Hardy-Weinberg  equilibrium,  and  if  this  is  not 
fulfilled,  some  modification  is  necessary.  The  sampling  variances  of  j 
and  D'x  have  also  been  obtained  by  Nei  and  Roychoudhury  (1974a). 


6.3  Gene  diversity  within  populations 

6.3.1  Enzyme  and  protein  loci 
1)  Outbreeding  organisms 

One  of  the  organisms  in  which  the  most  extensive  data  on  gene  frequencies 
are  available  is  man.  Surveying  the  literature,  Nei  and  Roychoudhury  (1972, 
1974b)  studied  the  average  heterozygosities  in  the  three  major  races  of  man, 
Caucasoids,  Negroids,  and  Mongoloids.  The  number  of  loci  of  which  the 
gene  frequency  data  were  available  was  74  loci  for  Caucasoids,  62  for 
Negroids,  and  35  for  Mongoloids.  The  average  heterozygosities  obtained 
are  given  in  table  6.1,  together  with  the  proportions  of  polymorphic  loci. 
The  average  heterozygosity  per  locus  for  Caucasoids  is  about  10  percent 
when  all  74  loci  are  used.  In  a similar  study  of  the  European  population, 
Harris  and  Hopkinson  (1972)  showed  that  the  average  heterozygosity  is 
7 percent.  The  difference  between  these  two  sets  of  data  is  probably  due  to 


Table  6.1 


Proportion  of  polymorphic  loci  and  average  heterozygosity  (gene  diversity)  for  protein 
loci  in  the  three  major  races  of  man.  Modified  from  Nei  and  Roychoudhury  (1974b). 


No.  of 
loci  used 

Polymorphic 

loci 

Average 

heterozygosity 

Codon  differences 
Dx  Dx ' 

Caucasoid 

a)  74 

0.31 

0.099  ± 0.021 

0.104 

0.130 

b)  62 

0.32 

0.104  ± 0.023 

0.110 

0.137 

c)  35 

0.40 

0.142  ± 0.034 

0.153 

0.187 

Negroid 

b)  62 

0.40 

0.092  ± 0.019 

0.097 

0.115 

c)  35 

0.51 

0.122  it  0.028 

0.131 

0.151 

Mongoloid 

c)  35 

0.40 

0.098  ± 0.027 

0.103 

0.122 

a)  All  loci  for  Caucasoids;  b)  Common  loci  for  Caucasoids  and  Negroids;  c)  Common 
loci  for  Caucasoids,  Negroids,  and  Mongoloids. 


Gene  diversity  within  populations 


133 


Table  6.2 

Average  heterozygosities  (gene  diversities)  within  random  mating  populations  of  various 
species.  Modified  from  Selander  and  Kaufman  (1973a). 


Organism  Number  Number  Gene  diversity 

of  species  of  loci  Mean  Range 


Invertebrates 


Drosophila a 

6 

16  - 23 

0.135 

0.08 

~ 0.21 

Field  cricketb 

1 

20 

0.145 



Horseshoe  erabc 

1 

25 

0.097 

_ 

Land  snaild 

1 

17 

0.207 

0.14 

- 0.25 

Weevils  (2  genera) e 

2 

17  - 24 

0.240 

0.17 

~ 0.31 

Lobster 

1 

43 

0.038 

_ 

Vertebrates 
Astyanax  (fish)* 

1 

17 

0.112 

Lizards  (3  genera)" 

4 

15  ~ 29 

0.058 

0.05 

- 0.07 

Rodents  (5  genera)' 

11 

18  ~ 41 

0.055 

0.01 

- 0.09 

Newts1 

3 

18 

0.084 

0.05 

« 0.11 

sparrowk 

1 

15 

0.059 

— 

* Prakash  (1969),  Prakash  et  al.  (1969),Lakovaara  and  Saura  (1971a,  b),  Ayala  et  al.  (1972), 
Richmond  (1972);  11  Selander  and  Kaufman  (1973a);  - Selander  et  al.  (1970);  a Selander 
and  Kaufman  (1973a);  c Soumalainen  and  Saura  (1973); c Tracey  et  al.  (1975); B Avise  and 
Selander  (1972); 11  Hall  and  Selander  (1973),  McKinney  et  al.  (1972),  Tinkle  and  Selander 
(1973),  Webster  et  al.  (1972);  1 Selander  and  Yang  (1969),  Selander  et  al.  (1969,  1971), 
Johnson  and  Selander  (1971),  Johnson  et  al.  (1972),  Patton  et  al.  (1972),  Smith  et  al.  (1973); 
J Hedgecock  and  Ayala  (1974);  * Nottebohm  and  Selander  (1972). 

the  fact  that  Nei  and  Roychoudhury  included  12  nonenzymic  loci  which  are 
more  polymorphic  than  enzymic  loci  in  man,  whereas  Harris  and  Hopkinson 
studied  only  enzymic  loci.  (In  many  other  vertebrate  species,  however, 
enzymic  and  nonenzymic  protein  loci  appear  to  be  equally  polymorphic;  see 
table  6.3.)  The  heterozygosities  of  the  three  major  races  may  be  compared 
by  using  62  or  35  common  loci.  It  is  clear  that  although  Caucasoids  seem 
to  be  genetically  more  heterogeneous  than  Negroids  and  Mongoloids,  the 
racial  differences  in  heterozygosity  are  not  statistically  significant.  Therefore, 
we  may  conclude  that  the  average  heterozygosity  or  gene  diversity  is  about 
10  percent  in  all  three  major  races. 

Table  6.1  includes  the  standard  and  maximum  estimates  of  codon  differ- 
ences per  locus  between  two  randomly  chosen  genomes.  These  estimates  are 
only  slightly  larger  than  the  average  heterozygosity,  which  is  a minimum 
estimate  of  codon  differences.  This  indicates  that  the  difference  between  two 
alleles  is,  in  a majority  of  cases,  caused  by  a single  codon  difference. 
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Average  heterozygosity  has  been  studied  in  many  organisms,  though  the 
number  of  loci  examined  is  not  always  large.  Table  6.2  gives  the  estimates 
of  average  heterozygosity  for  various  organisms  in  which  a relatively  large 
number  of  loci  have  been  studied.  The  standard  errors  of  these  estimates 
are  not  known  but  appear  to  be  large.  It  is  seen  that  the  average  hetero- 
zygosity varies  considerably  with  organism.  It  tends  to  be  smaller  in  verte- 
brates than  in  invertebrates,  though  there  are  many  exceptions.  This  is 
probably  due  to  the  fact  that  the  population  size  of  vertebrate  species  is 
generally  much  smaller  than  that  of  invertebrate  species.  The  highest  value 
observed  so  far  is  0.309  in  Otiorrhynchus  scaber  (weevil;  Soumalainen  and 
Saura,  1973),  while  the  lowest  value  is  almost  0 in  Dipodomys  panamintinus 
(Johnson  and  Selander,  1971),  though  the  number  of  loci  examined  was  only 
17  in  the  latter.  The  average  heterozygosities  of  the  species  in  the  genus 
Dipodomys  (kangaroo  rats)  are  generally  very  small  (H  = 0.000  ~ 0.051) 
compared  with  those  of  other  outbreeding  organisms.  This  low  level  of  gene 
diversity  probably  reflects  the  relatively  small  effective  population  size  at 
present  or  in  the  past  in  these  animals.  These  nocturnal  and  burrowing 
rodents  are  distributed  in  the  limited  areas  of  the  Western  and  South- 
western United  States  and  Mexico.  Particularly,  D.  panamintinus  and  D. 
elator,  which  have  the  lowest  level  of  gene  diversity,  are  distributed  in  small 
geographic  areas  (Johnson  and  Selander,  1971).  A low  level  of  average 
heterozygosity  (1.7  %)  was  also  observed  in  the  Japanese  macaque,  of  which 
the  population  (census)  size  has  been  estimated  to  be  20,000  ~ 70,000 
(Nozawa  et  al.,  1974).  The  theoretical  expectation  that  gene  diversity  is 
smaller  in  small  populations  than  in  large  populations  has  been  demonstrated 
in  the  comparison  of  cave  (H  = 0 ~ 7.7  %)  and  surface  (H  = 7.7  ~ 13.8  %) 
populations  of  the  characid  fish  Astyanax  mexicanus  (Avise  and  Selander, 
1972)  and  an  island  (H  = 0.02)  and  continental  (0.05  ~ 0.08)  populations  of 
Peromyscus  polionotus  (Selander  et  al.,  1971).  Furthermore,  Bonnell  and 
Selander  (1974)  have  recently  reported  that  in  the  northern  elephant  seal 
Mirounga  angustirostris  which  experienced  an  extremely  small  bottleneck  in 
population  size  (about  20  individuals)  owing  to  heavy  hunting  in  the  last 
century  no  polymorphisms  exist  at  the  24  protein  loci  studied. 

If  we  exclude  the  organisms  with  small  effective  population  size,  however, 
the  average  heterozygosity  of  outbreeding  organisms  is  about  10  percent. 
Namely,  an  individual  appears  to  be  heterozygous  for  10  percent  of  the  total 
genes.  These  estimates  were  obtained  by  studying  electrophoretically  detect- 
able protein  loci.  As  discussed  in  ch.  3,  only  about  25  ™ 30  percent  of  codon 
differences  are  detected  by  electrophoresis.  If  we  make  the  correction  for 
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this  factor,  an  individual  is  expected  to  be  heterozygous  for  about  30  to 
40  percent  of  its  total  genes.  The  exact  number  of  structural  genes,  i.e., 
protein-coding  cistrons,  in  higher  organisms  is  not  known.  Muller’s  (1967) 
guess  for  this  number  in  man  is  30,000.  We  have  noted  that  the  average 
heterozygosity  or  gene  diversity  is  equal  to  the  average  probability  of  non- 
identity of  two  randomly  chosen  genes.  Therefore,  if  all  loci  are  in  linkage 
equilibrium,  the  probability  that  two  genomes,  one  from  each  of  two 
randomly  chosen  individuals,  have  the  same  array  of  genes  for  the  30,000 
loci  is  (I  - ff)3 10 which  is  equal  to  10“  1372  for  H = 0.1  and  IQ-6655 
for  H = 0.4.  For  the  two  individuals  to  be  genetically  identical,  the  other 
genomes  must  also  be  identical.  If  we  note  that  the  present  world  population 
of  man  is  3.6  x 109,  this  clearly  indicates  that  any  two  individuals  in  this 
world  must  be  genetically  different  except  identical  twins.  This  is  true  for  all 
organisms  in  nature,  which  reproduce  by  outbreeding.  It  is  safe  to  state 
that  in  the  whole  history  of  mammalian  evolution  no  two  individuals  have 
ever  been  genetically  identical  except  identical  twins  and  artificially  inbred 
laboratory  animals. 

From  table  6.1  we  estimate  that  the  number  of  heterozygous  codons 
(codon  differences)  in  man  is  about  0.3  ~ 0.6  per  locus  after  correction  for 
electrophoretic  detectability.  An  'average  cistron'  in  man  seems  to  have 
about  400  codons  (ch.  3).  Therefore,  roughly  speaking,  about  0.1  percent 


Fig.  6.1.  Frequency  distributions  of  heterozygosity  for  protein  and  blood  group  loci  in 
man  (Caucasoids).  From  Nei  and  Roychoudhury  (1974b). 
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of  the  codons  are  expected  to  be  heterozygous.  We  have  also  seen  that  the 
probability  of  a nucleotide  substitution  resulting  in  an  amino  acid  substitu- 
tion is  about  3/4.  If  we  make  a further  correction  for  this  effect,  noting  that 
each  codon  is  composed  of  three  nucleotide  pairs,  the  proportion  of  hetero- 
zygous nucleotide  sites  is  estimated  to  be  about  4 x 10“ 4.  The  human 
haploid  genome  has  about  3.2  x IQ9  nucleotide  pairs.  Therefore,  an  average 
man  is  heterozygous  for  some  1,200,000  nucleotide  sites  (see  also  Kimura, 
1973).  This  indicates  how  vast  the  genetic  variability  in  man  is  at  the  nucleo- 
tide level.  It  is  clear  from  table  6.2  that  a similar  conclusion  can  be  made 
with  most  outbreeding  higher  organisms. 

So  far  we  have  been  concerned  with  average  heterozygosity  or  average 
numbers  of  heterozygous  codons  and  nucleotide  pairs.  However,  hetero- 
zygosity varies  considerably  with  locus.  Fig.  6.1  shows  the  frequency 
distributions  of  heterozygosity  for  74  proteins  and  57  blood  group  loci  in 
Caucasoid  populations  of  man.  The  distributions  are  both  inverted-J  shaped 
with  a small  peak  in  the  tail.  At  about  65  percent  of  the  loci  studied  hetero- 
zygosity is  smaller  than  0.02,  but  at  a few  loci  it  is  as  large  as  about  0.5.  A 
similar  distribution  has  been  obtained  for  Negroid  and  Mongoloid  popula- 
tions (Nei  and  Roychoudhury,  1974b).  This  type  of  distribution  seems  to 
hold  also  with  other  organisms,  though  the  proportion  of  polymorphic  loci 
varies  considerably  with  the  organism. 

This  high  degree  of  interlocus  variation  is  theoretically  expected  if  each 
locus  undergoes  gene  substitution  independently  at  a low  rate.  A locus 
becomes  polymorphic  when  gene  substitution  is  taking  place  or  when  a 
mutant  gene  has  become  frequent  by  chance  though  it  is  destined  eventually 
to  disappear  from  the  population.  But  otherwise  it  is  monomorphic.  Natural 
populations  include  a mixture  of  loci  which  are  at  various  stages  of  evolution. 
Therefore,  a high  degree  of  interlocus  variation  in  heterozygosity  would 
result.  The  interlocus  variation  may  also  be  induced  by  the  difference  in 
mutation  rate  or  natural  selection  among  loci.  The  rate  of  amino  acid  sub- 
stitution per  polypeptide  varies  considerably  with  locus  (ch.  3).  The  expected 
heterozygosity  is  larger  when  this  rate  (or  mutation  rate)  is  high  than  when 
this  is  low.  At  the  majority  of  the  enzyme  or  protein  loci  so  far  studied,  the 
mutation  rate  or  the  rate  of  gene  substitution  is  not  known,  but  there  must 
be  some  degree  of  interlocus  variation  in  this  quantity.  A similar  effect  may 
be  produced  if  the  type  and  intensity  of  natural  selection  vary  with  locus. 

Selander  and  Johnson  (1  973)  studied  the  gene  diversities  (heterozygosities) 
of  various  proteins  in  rodents  Ttuwwmys  (2  species),  Dipodomys  (3),  Sig- 
ntotfon  (2),  Peromyscus  (4),  and  Mas  (3  semispecies);  a passerine  bird, 
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Table  6.3 


Average  gene  diversities  (heterozygosities)  for  different  proteins.  From  Sclander  and 
Johnson  (1973). 


Protein* 

No.  of 
species 

Species 

polymorphic 

<%) 

Avcmge 
gene  diversity 

Group  I 

Super.  NAD-MDH 

23 

17 

0.0066 

Mito.  NAD-MDH 

24 

17 

0.0119 

Super.  ME 

It 

27 

0.0553 

6PGD 

23 

74 

0.0840 

G6PD 

12 

8 

0.OO28 

aGPD 

21 

65 

0.0676 

Super.  IDH 

21 

57 

0.07I9 

Mito.  IDH 

18 

n 

0.0031 

LDH-1 

24 

50 

0.0469 

LDH-2 

24 

42 

O.0127 

PGI 

21 

57 

0.04  F0 

PGM-1 

24 

79 

0.1072 

PGM-2  or  PGM- 3 

15 

53 

O.1280 

Mean 

43,7 

0.0492 

Group  II 

ADH 

16 

44 

0.0908 

SDH 

6 

0 

O.OGOO 

Super.  GOT 

21 

57 

0.0475 

Mito.  GOT 

17 

IS 

0,0018 

IPO** 

18 

22 

0.0454 

Esterasesf 

E6<4-25/sp.) 

44 

0,134! 

Mean 

30,8 

O.OS33 

Group  III 

ALB 

23 

30 

0.0610 

TRF 

IS 

67 

0.1033 

HB  (2  loci) 

17 

21 

0,0605 

General  proteinsff 

24(3.17^spJ 

S 

O.0054 

Mean 

29.4 

0.05E2 

Grand  mean 

37,50 

0.0S19 

* Group  I:  Glucose-metabolizing  enzymes;  Group  II;  Other  enzymes;  Group  111: 
Nonenzymatic  proteins. 

Homology  across  species  uncertain  for  indophenol  oxidase. 


t 68  esterases,  or  a mean  of  4.25  loci  per  species;  30  loci  polymorphic.  Values  are  means 
for  all  loci. 

ft  76  'general  proteins',  or  a mean  of  3.17  loci  per  species;  6 loci  polymorphic.  Values 
are  means  for  all  loci. 
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Zonotrichia  (1);  lizards  Sceloporus  (3),  Anolis  (4),  and  Uta  (1);  and  a fish, 
Astyanax  (1).  The  estimate  of  average  gene  diversity  for  each  of  the  proteins 
studied  is  given  in  table  6.3.  There  is  a wide  range  ofvariation  amongproteins; 
esterases  and  PGM  show  a high  degree  of  gene  diversity,  while  G6PD,  SDH, 
general  proteins,  etc.,  show  a low  gene  diversity.  Clearly,  gene  diversity 
varies  with  locus.  However,  caution  must  be  exercised  in  the  interpretation 
of  these  data,  since  some  of  the  species  studied  are  closely  related.  As  we 
have  seen  in  ch.  5,  polymorphic  genes  may  persist  in  the  population  longer 
than  species  life,  so  that  the  gene  diversity  at  a locus  in  a species  may  be 
correlated  to  that  of  the  other  species,  if  they  are  closely  related. 

Many  proteins  examined  by  electrophoresis  are  of  unknown  physiological 
function  and  have  broad  substrate  specificities  (nonspecificity).  Gillespie  and 
Kojima  (1968)  proposed  the  hypothesis  that  enzymes  known  to  be  active 
in  energy  metabolism  (Group  I)  are  virtually  monomorphic  or  at  least  less 
polymorphic  than  nonspecific  enzymes  (Group  II).  This  hypothesis  is  sup- 
ported by  the  data  on  gene  diversity  in  some  species  of  Drosophila  (Kojima 
et  al.,  1970;  Ayala  and  Powell,  1972)  and  in  man  (Cohen  et  al.,  1973),  while 
Nair  et  al.  (1971)  failed  to  confirm  this  in  six  species  of  the  mesophragmatica 
group  of  Drosophila.  This  problem  should  be  examined  by  using  widely 
varying  organisms.  A glance  at  table  6.3  reveals  that  the  Gillespie-Kojima 
hypothesis  does  not  necessarily  hold  in  vertebrates. 

Johnson  (1974)  proposed  a similar  hypothesis,  claiming  that  'regulatory 
enzymes'  are  more  polymorphic  than  'nonregulatory  enzymes'.  The  data  he 
compiled  support  this  hypothesis,  though  there  are  some  problems  in  his 
classification  of  enzymes  and  statistical  analysis.  He  took  this  result  as 
evidence  against  the  neutral  mutation  hypothesis.  This  conclusion,  however, 
is  not  warranted.  If  the  difference  in  polymorphism  between  the  two  groups 
of  enzymes  is  real,  it  may  mean  that  the  degree  of  functional  requirement 
in  protein  structure  is  different  between  the  two  groups.  But  the  poly- 
morphism in  each  enzyme  may  still  be  neutral  (ch.  8). 

One  of  the  important  questions  about  protein  polymorphism  is  whether 
it  is  related  to  the  variation  of  morphological  characters.  This  problem  was 
studied  by  Soule  et  al.  (1973)  in  eight  species  of  Anolis  lizards  and  thirteen 
populations  of  the  side-blotched  lizards  Uta  stansburiana.  They  found  a 
strong  correlation  between  the  level  of  intraspecies  gene  diversity  and  the 
coefficient  of  variation  of  the  number  of  subdigital  scales  on  a toe.  In 
U.  stansburiana,  however,  the  correlation  between  gene  diversity  and  mean 
coefficient  of  variation  for  five  morphological  characters  was  rather  weak. 
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2)  Asexual  reproduction  and  parthenogenesis 
Although  mail  higher  animals  reproduce  bisexually,  most  of  the  lower 
organisms,  many  plants,  and  some  invertebrate  animals  reproduce  asexually, 
parthenogenetically,  or  by  selfing.  Reproductive  methods  affect  the  popula- 
tion dynamics  of  genes  considerably.  The  population  dynamics  of  genes  is 
also  affected  by  ploidy  of  the  organism. 

Asexual  reproduction  and  parthenogenesis  have  virtually  the  same  effect, 
though  there  are  various  kinds  ofpartlienogenesis  in  plants.  Both  reproductive 
methods  prevent  the  recombination  of  genes  and  the  whole  set  of  genes  in  an 
individual  is  inherited  together  to  the  next  generation.  Thus,  the  unit  of 
inheritance  is  not  the  gene  but  the  genotype,  and  all  genes  are  'completely 
linked'.  The  unit  of  sampling  at  the  time  of  reproduction  is  also  the  geno- 
type rather  than  the  gene.  In  this  respect  each  genotype  behaves  just  like 
a single  allele  of  a multiple- allelic  locus  in  haploid  organisms.  However, 
mutation  occurs  at  each  locus  separately  and  the  gene  is  still  the  unit  of 
function.  Therefore,  protein  polymorphism  is  examined  for  each  locus  or 
for  each  protein  separately.  Average  gene  diversity  (heterozygosity)  per  locus 
still  can  be  computed  in  the  same  way  as  in  the  case  of  random  mating 
population.  Nevertheless,  it  must  be  kept  in  mind  that  all  the  genes  are 
'completely  linked'  and  thus  a strong  linkage  disequilibrium  is  expected  to 
occur  among  different  loci.  Also,  genotype  frequencies  at  a locus  generally 
do  not  follow  Hardy -Weinberg  proportions,  so  that  gene  diversity  has 
nothing  to  do  with  the  proportion  of  heterozygotes  in  the  population.  It 
simply  measures  the  amount  of  genetic  variability  of  a population,  as 
originally  intended. 

It  has  often  been  assumed  that  asexual  organisms  are  in  the  dead  end  of 
evolution  and  lack  of  recombination  reduces  the  genetic  variability  in  these 
organisms.  This  assumption  is,  of  course,  not  warranted,  because  the  source 
of  genetic  variability  is  not  recombination  but  mutation.  If  mutation  rate 
and  population  size  remain  the  same,  we  would  expect  that  the  average 
gene  diversity  per  locus  in  an  asexual  population  is  more  or  less  the  same  as 
that  of  a random  mating  population.  Natural  selection  specific  to  asexual 
organisms  may  increase  or  decrease  the  gene  diversity. 

Unfortunately,  only  a few  studies  have  been  made  on  the  gene  diversity 
of  asexual  or  parthenogenetic  organisms.  Nevertheless,  they  provide  an 
insight  into  some  intriguing  features  of  asexual  reproduction.  Levin  and 
Crepet  (1973)  studied  the  polymorphisms  of  1 1 proteins  encoded  by  18  loci 
in  16  populations  of  a phylogenetic  relic  plant,  Lycopodium  lucidulum  (fern), 
in  Connecticut  and  New  York.  In  13  loci  out  of  18,  all  the  populations  were 
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Table  6.4 


Gene  frequencies  at  the  polymorphic  loci  and  average  gene  diversity  per  locus  (H)  in 
Lycopodium  lucidulum.  The  total  number  of  loci  examined  is  18.  From  Levin  and  Crepet 
(1973). 


Locus: 

allele 

Woodridge, 

Conn. 

(N  = 11)* 

Litchfield, 
Conn. 
(N  = 28) 

Binghamton, 

N.Y. 

(N=  14) 

New  Lebanon, 
N.Y. 

(N  = 28) 

PGI-2 

a 

QM 

1.00 

i.m 

0.75 

b 

0.32 

0.00 

0.00 

0.25 

GtiPD-1 

a 

1.00 

0.S2 

0.3 1 

b 

0,07 

0.00 

O.IS 

0.09 

g*pe y-i 

a 

1,00 

1.00 

0,50 

1. 00 

b 

0.00 

Q.OD 

0.50 

0.00 

PGM 

a 

0.00 

0.00 

0.50 

0.00 

b 

0.86 

1.00 

0.50 

1.00 

c 

0.14 

0.00 

n.no 

0.00 

LCGP-I 

a 

0.50 

0.50 

1.00 

LOO 

b 

0.50 

0.50 

0-00 

000 

Average  gene 

diversity 

0.07 

0.03 

0.07 

003 

N = Number  o/individuals  examined. 


monomorphic  for  the  same  allele.  In  the  remaining  five  loci,  however, 
polymorphism  was  observed  in  some  or  all  populations.  Average  gene 
diversities  in  four  representative  populations  are  given  in  table  6.4,  together 
with  the  gene  frequencies  for  polymorphic  loci.  As  expected,  average  gene 
diversity  varies  considerably  with  population,  but  the  overall  mean  for  the 
four  populations  is  not  much  different  from  the  values  for  some  vertebrates. 

Examination  of  the  gene  frequencies  in  table  6.4,  however,  reveals  that 
the  gene  frequency  pattern  within  populations  is  quite  different  from  that 
of  random  mating  populations.  First,  the  frequency  of  an  allele  is  often  1, 
0,  or  0.5.  This  is  because  the  individuals  in  a population  are  often  all  homo- 
zygous for  a particular  allele  t>i  all  heterozygous  for  a particular  pair  of  alleles, 
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That  is,  even  if  the  gene  frequency  is  0.5,  the  population  may  be  Homogeneous 
at  that  locus.  In  fact,  the  Litchfield  population  is  entirely  homogeneous  with 
respect  to  the  18  loci  studied,  and  consists  of  a single  genotype,  though 
average  gene  diversity  is  not  0.  Namely,  in  this  case,  even  if  gene  diversity 
is  not  0,  'genotype  diversity'  is  0. 

The  second  feature  of  the  gene  frequency  pattern  in  L.  htciduhtm  is  that 
the  gene  or  genotype  frequency  varies  conspicuously  among  the  four 
populations,  though  these  populations  are  geographically  located  rather 
close  to  each  other.  For  example,  at  the  LGGP-I  locus  genotype  a/b  is  fixed 
in  the  Woodridge  and  Litchfield  populations,  while  in  the  Binghamton  and 
New  Lebanon  populations  genotype  a/a  is  fixed.  In  organisms  which  re- 
produce by  random  mating  such  a difference  in  gene  or  genotype  frequency 
rarely  occurs. 

The  above  two  patterns  of  gene  frequency  distributions  suggest  that  the 
effective  number  of  these  populations  is  very  small.  The  population  biology 
of  this  organism  is  not  well  known,  but  it  is  possible  that  a relatively  small 
number  of  individuals  produce  a large  number  of  descendants  in  each 
locality  and  other  individuals  reproduce  virtually  no  offspring.  Since  the 
unit  of  inheritance  is  the  individual,  the  heterozygote  at  a particular  locus 
may  be  fixed  in  the  population  by  genetic  drift.  Clearly,  the  frequency  of 
heterozygotes  has  little  to  do  with  heterozygote  advantage. 

The  gene  frequency  pattern  in  table  6.4  also  gives  an  insight  into  the 
reproductive  biology  of  this  organism.  In  ch.  3 we  have  seen  that  new 
mutations  are  almost  always  different  from  preexisting  alleles.  In  asexual 
diploids  each  of  the  two  gene  doses  at  a locus  mutates  independently,  so  that 
the  two  genes  will  gradually  differentiate  from  each  other  in  the  absence  of 
meiotic  mechanism  (White,  1954).  The  decline  of  electrophoretic  identity 
of  proteins  is  slower  than  that  of  protein  identity  at  the  amino  acid  level 
(Nei  and  Chakraborty,  1973),  but  after  a sufficient  period  of  evolutionary 
time  the  electrophoretic  identity  of  proteins  encoded  by  the  two  genes  must 
be  very  small.  Particularly,  L.  lucidulum  is  believed  to  be  a direct  descendant 
of  the  Devonian  stock  and  the  morphology  of  this  species  closely  resembles 
that  of  the  Devonian  fossil  species  about  300  million  years  ago.  Then,  we 
would  expect  that  the  proteins  encoded  by  the  two  allelic  genes  at  a locus 
almost  always  have  different  mobilities.  Namely,  virtually  all  plants  will  be 
heterozygous.  Table  6.4,  however,  indicates  that  this  is  not  the  case. 

This  unexpected  result  may  be  explained  by  one  of  the  following  two 
hypotheses.  The  first  is  that  this  plant  occasionally  reproduces  sexually.  In 
fact,  Levin  and  Crepet  (1973)  state  that  reproduction  may  be  accomplished 
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asexually  by  bulbils  or  sexually  by  spores,  though  they  believe  that  it  is 
primarily  or  almost  exclusively  asexual  in  practice.  If  there  is  a small  prob- 
ability of  sexual  reproduction,  the  genes  in  different  plants  are  eventually 
recombined  and  the  existence  of  homozygotes  is  no  longer  mysterious.  The 
second  hypothesis  is  that  most  loci  are  actually  heterozygous  but  form  a 
single  electrophoretic  band  because  one  of  the  two  alleles  at  the  apparently 
homozygous  loci  is  nonfunctional  and  produces  no  protein.  Since  lethal 
genes  are  sheltered  by  asexual  reproduction,  as  will  be  discussed  later,  it  is 
possible  that  asexual  diploids  have  a large  number  of  nonfunctional  genes 
in  heterozygous  condition.  In  L.  lucidulum  perhaps  the  first  hypothesis  is 
correct,  but  in  strictly  asexual  organisms  the  second  possibility  cannot  be 
neglected. 

In  some  species  of  weevils  there  are  bisexual  and  parthenogenetic  races. 
The  parthenogenetic  races  in  Otiorrhynchus  scaber  are  triploid  or  tetraploid 
and  sexually  isolated  from  the  diploid  races  (Soumalainen,  1969).  Souma- 
lainen  and  Saura  (1973)  studied  the  protein  polymorphism  for  over  25  loci 
in  these  races.  Their  data  clearly  indicate  that  the  genic  variation  in  partheno- 
genetic races  is  no  less  than  that  of  bisexual  diploid  races,  though  no  quanti- 
tative comparison  has  been  made.  (The  gene  diversity  for  diploid  races  is 
0.309.)  Theoretically,  as  mentioned  earlier,  the  formula  for  gene  diversity 
(6.5)  can  be  used  for  any  organism.  In  practice,  however,  it  is  not  easy  to 
determine  gene  frequencies  for  protein  loci  in  triploids  or  tetraploids,  since 
the  gene  dosage  at  a locus  cannot  always  be  determined  by  electrophoresis. 
Namely,  genotypes  A,A2A2  and  A,AlA2  in  triploids,  for  instance,  cannot 
always  be  determined  by  the  intensity  of  electrophoretic  bands.  The  absence 
of  sexual  reproduction  prohibits  the  genetic  tests  of  such  genotypes.  Clearly, 
a more  refined  biochemical  technique  needs  to  be  developed. 

Soumalainen  and  Saura's  data,  however,  throw  some  light  on  the  origin 
of  the  triploid  and  tetraploid  races.  Soumalainen  (1961)  believes  that  they 
are  monophyletic,  that  is,  the  triploid  or  tetraploid  race  has  originated  from 
a single  diploid  individual  or  a few  closely  related  diploids.  On  the  other 
hand,  White  (1970)  favors  the  polyphyletic  origin.  If  all  the  present  triploid 
or  tetraploid  individuals  are  the  descendants  of  a single  individual  in  an 
ancestral  diploid  population,  then  all  of  them  must  have  the  same  genotype 
as  that  of  the  first  polyploid,  unless  new  mutations  occurred.  Thus,  if  the 
original  genotype  was  heterozygous  for  a particular  locus,  all  individuals 
are  expected  to  be  heterozygous  in  the  absence  of  mutation.  On  the  other 
hand,  this  would  not  happen  if  the  origin  is  polyphyletic,  since  the  probability 
that  polyploidization  occurs  many  times  in  the  same  genotype  (heterozygote) 
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is  very  small.  Soumalainen  and  Saura's  data  show  that  all  the  triploids 
examined  are  heterozygous  for  the  same  pair  of  alleles  at  the  Adk-2  locus 
and  all  the  tetraploids  are  heterozygous  for  the  same  pair  of  alleles  at  the 
Acph-2,  Adk-2,  and  Tpi  loci.  This  strongly  supports  the  hypothesis  of  mono- 
phyletic  origin. 

There  is,  however,  one  difficultyin  this  hypothesis.  Namely,  as  mentioned 
earlier,  the  present  triploid  and  tetraploid  races  have  several  different  geno- 
types at  many  loci.  These  genotypic  variations  at  a locus  all  must  have 
occurred  by  mutation,  if  the  origin  is  monophyletic.  Therefore,  the  polyploid 
races  are  expected  to  have  many  alleles  different  from  those  of  diploid  races. 
In  reality,  however,  the  majority  of  the  polyploid  alleles  are  the  same  as 
those  of  diploids.  Clearly,  a more  detailed  study  is  required. 

At  any  rate,  studies  on  enzyme  polymorphism  seem  to  be  very  useful  in 
solving  various  problems  in  population  biology  and  evolution.  For  an 
additional  example,  Crozier  (1973)  studied  the  pattern  of  polymorphism  at 
the  malate  dehydrogenase-a  locus  in  the  ant  Aphaenogaster  rudis  and  found 
evidence  that  in  queens  of  this  species  both  monogamy  and  single  insemina- 
tion are  the  rule. 

3)  Selfing  organisms 

Some  plants  and  some  invertebrate  animals  reproduce  by  selfing  or  self- 
fertilization.  From  the  viewpoint  of  population  dynamics  of  genes,  selfing 
is  similar  to  asexual  reproduction.  Although  all  gametes  are  produced 
through  meiotic  division,  selfing  prohibits  the  recombination  of  mutant 
genes  which  occurs  in  different  individuals.  Just  as  in  asexual  organisms,  the 
whole  set  of  genes  in  an  individual  is  transmitted  together  to  its  offspring, 
though  at  a small  proportion  of  loci  gene  segregation  would  occur  because  of 
occasional  mutations.  In  artificially  produced  hybrid  populations,  of  course, 
a large  number  of  genes  would  segregate  in  the  firstfew  generations  but  all  loci 
quickly  become  homozygous.  Nevertheless,  if  we  examine  a large  population 
of  selfing  organisms,  we  would  expect  a considerable  amount  of  genetic 
variability.  The  effective  size  of  a selfing  population  of  size  N is  approxi- 
mately N/2.  Therefore,  the  average  gene  diversity  for  neutral  genes  is  expected 
to  be  slightly  smaller  than  that  of  a randomly  mating  population  of  the  same 
size.  Since  recombination  of  genes  is  virtually  absent,  alleles  at  different  loci 
are  expected  to  be  generally  in  linkage  disequilibrium.  There  is,  however, 
one  important  difference  between  asexual  and  selfing  organisms.  Namely, 
in  strictly  asexual  diploids  or  polyploids,  all  individuals  are  expected  to  be 
eventually  heterozygous  for  all  loci,  while  in  strictly  selfing  organisms  vir- 
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tually  all  individuals  will  be  homozygous  for  most  of  the  loci.  In  practice, 
of  course,  most  self-fertilizing  organisms  exercise  a small  amount  of  out- 
breeding. 

An  extensive  study  on  the  protein  polymorphisms  in  self-fertilizing  plants, 
Avenafatua  and  A.  barbata  (wild  oats),  has  been  made  by  Allard  and  his 
associates.  As  expected,  all  natural  populations  of  these  species  are  poly- 
morphic at  least  for  some  loci.  The  proportion  of  polymorphic  loci  has  been 
estimated  to  be  54  percent  in  A.fatua  and  31  percent  in  A.  barbata  (Marshall 
and  Allard,  1970a),  though  this  is  based  on  a tentative  identification  of  gene 
loci.  Reliable  estimates  of  the  average  gene  diversity  per  locus  for  these 
plants  have  not  yet  been  obtained,  but  this  quantity  seems  to  vary  consider- 
ably from  location  to  location.  Hamrick  and  Allard  (1972)  studied  the 
average  gene  diversity  per  locus  for  five  enzyme  loci  (three  esterases,  one 
phosphatase,  and  one  anodal  peroxidase)  in  eleven  different  locations  (near 
Calistoga,  California),  seven  of  which  are  separated  from  each  other  by 
spaces  of  only  about  3 ~ 38  meters.  The  average  gene  diversity,  which  they 
called  polymorphic  index,  varied  from  0 to  0.421.  They  related  this  variation 
of  gene  diversity  to  the  degree  of  aridity  of  environment.  However,  some 
part  of  the  variation  must  be  due  to  genetic  drift.  In  selfing  plants  seeds  are 
generally  less  well  mixed  in  the  process  of  reproduction  than  gametes  in 
outbreeding  organisms  and  thus  the  effective  population  size  appears  to  be 
relatively  small. 

A striking  bottleneck  effect  in  a self-fertilizing  land  snail,  Rumina  decollata, 
was  recently  reported.  This  organism  was  apparently  introduced  from  Europe 
before  1822  and  is  now  distributed  throughout  the  southern  part  of  North 
America.  Selander  and  Kaufman  (1973b)  studied  the  genetic  variability  at 
25  enzyme  loci  in  California,  Arizona,  South  Carolina,  and  Texas,  but  found 
no  polymorphism,  all  the  individuals  in  these  areas  being  of  the  same  single 
genotype.  On  the  other  hand,  the  populations  in  southern  France  and 
northern  Africa  had  many  different  alleles,  though  the  genetic  variability 
within  populations  was  virtually  absent.  As  Selander  and  Kaufman  con- 
cluded, the  absence  of  genetic  variability  in  the  North  American  population 
is  clearly  due  to  the  fact  that  this  population  was  descended  rather  recently 
from  a single  population  somewhere  in  southern  France,  which  was,  in 
turn,  derived  from  a single  ancestral  individual.  It  is  interesting  to  note  that 
a population  can  colonize  a new  territory  successfully  without  much  genetic 
variability  at  the  enzyme  level. 
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6.3.2  Blood  groups  and  other  loci 

1)  Red  ocH  antigens 

Blood  groups,  which  arc  distinguished  by  red  cell  antigens,  are  the  earliest 
genetic  polymorphisms  discovered  in  natural  populations.  In  man  more  than 
100  red  cell  antigens  have  been  identified,  though  we  do  not  know  what 
proportion  of  the  human  genome  is  concerned  with  blood  cell  antigens. 
These  antigens  are  found  only  when  there  are  polymorphic  or  variant 
antigens  in  the  same  blood  group  system.  Almost  the  same  degree  of  poly- 
morphism in  blood  groups  is  believed  to  exist  in  other  mammalian  species 
(Race  and  Sanger,  1968).  In  man  there  is  a large  amount  of  data  on  blood 
group  gene  frequencies  in  various  populations.  The  average  heterozygosity 
per  locus  was  thus  computed  for  the  three  major  races  of  man.  The  results 
are  given  in  table  6.5.  The  average  heterozygosity  clearly  depends  on  the 
number  of  loci  studied;  it  is  higher  when  the  number  of  loci  used  is  small 
than  when  this  is  large.  This  is  because  the  discovery  of  a polymorphic  locus 
is  easier  than  that  of  a monomorphic  one  in  blood  groups  (Lewontin,  1967). 
From  a study  of  the  change  of  the  cumulative  average  gene  diversity  over 
the  year  of  discovery,  Nei  and  Roychoudhury  (1974b)  have  concluded  that 
the  heterozygosity  for  Negroids  and  Mongoloids  are  probably  overestimates, 
while  that  for  Caucasoids  appears  to  be  close  to  the  actual  value.  Thus, 
about  13  percent  of  blood  group  loci  in  an  individual  appear  to  be  hetero- 
zygous on  the  average. 

This  estimate  is  close  to  that  for  protein  loci  but  this  does  not  mean  that 
the  gene  diversity  at  the  codon  level  is  the  same  for  the  two  kinds  of  gene  loci. 
The  relationship  between  the  immunological  reaction  and  the  gene  is  still 


Table  6.5 

Proportion  of  polymorphic  loci  ( P ) and  average  heterozygosity  (H)  (gene  diversity)  for 
blood  group  loci  in  the  three  major  races  of  man.  From  Nei  and  Roychoudhury  (1974b). 


No.  of  loci  used 

Caucasoid 

Negroid 

Mongoloid 

P 

H 

P 

H 

P H 

a)  37 

C.J7 

0.00 

b)  34 

0,56 

0,44 

0.I62 

ul  21 

0.71 

0.264 

0.62 

0,118 

0,62  0.242 

a)  All  loci  for  Caucasoids;  b)  Common  loci  for  Caucasoids  and  Negroids;  c)  Common 
loci  for  the  three  major  races. 
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not  well  understood.  Blood  group  substances  or  antigens  are  usually  com- 
ponents of  the  red  cell  membrane,  and  apparently  many  of  them  are  not 
proteins.  For  example,  the  substances  that  confer  the  immunological 
specificities  in  the  ABO  and  Lewis  blood  group  systems  are  carbohydrate  in 
nature.  Presumably,  the  blood  group  genes  code  for  some  specific  proteins 
which  themselves  have  enzymatic  properties  or  which  control  enzymes 
involved  in  the  synthesis  of  nonprotein  blood  group  substances  (Watkins, 
1967).  Of  course,  if  there  is  any  genetic  difference  in  blood  group  substance, 
there  must  be  at  least  one  amino  acid  difference  between  the  proteins 
controlling  the  different  blood  group  substances,  but  it  is  not  known 
whether  all  amino  acid  differences  between  the  proteins  are  reflected  as 
antigenic  differences  or  not.  It  is  also  often  difficult  to  decide  whether  a 
group  of  closely  associated  antigens  are  controlled  by  one  locus  or  by  multiple 
loci,  since  the  proteins  coded  for  by  blood  group  genes  are  not  known. 

2)  White  cell  antigens 

The  antigens  in  blood  are  not  confined  to  the  red  cell  but  also  occur  in  the 
white  cell.  The  best-known  example  is  the  histocompatibility  antigens  which 
control  skin  graft  compatibility.  If  the  recipient  of  a skin  graft  has  the  same 
antigens  or  at  least  all  the  antigens  carried  by  the  donor,  the  skin  is  accepted, 
i.e.  the  graft  is  compatible,  but  otherwise  the  skin  is  rejected,  i.e.  the  graft 
is  incompatible.  One  of  the  main  determinants  of  these  histocompatibility 
antigens  is  the  white  cell  HL-A  system  in  man.  The  H2  system  in  mice  is 
also  located  on  the  white  cells  (leukocyte).  The  genetics  of  the  HL-A  system 
is  very  complicated,  but  at  present  it  is  believed  that  this  consists  of  two 
major  series  of  antigens,  LA  and  4,  each  of  which  behaves  as  if  its  constituent 
antigens  were  controlled  by  a set  of  alleles  at  a single  locus  (Bodmer,  1972). 
The  LA  and  4 loci  (regions)  appear  to  be  closely  linked.  The  H2  system  in 
mice  seems  to  be  homologous  to  the  HL-A  system  in  man  and  can  be 
separated  into  two  series,  the  D and  K loci  (regions).  There  are  at  least  nine 
different  alleles  at  the  LA  locus  and  14  different  alleles  at  the  4 locus  all  of 
which  have  a frequency  equal  to  or  higher  than  0.01  in  Caucasian  popula- 
tions (Bodmer,  1972).  The  heterozygosity  or  gene  diversity  has  been  estimated 
to  be  0.82  and  0.90  for  the  LA  and  4 loci,  respectively.  These  values  are 
much  higher  than  those  for  protein  or  blood  group  loci.  In  practice,  how- 
ever, it  is  not  known  whether  the  LA  or  4 locus  antigens  all  represent  true 
alleles  at  the  same  cistron  or  pseudoalleles  at  multiple  cistron  loci.  If  they 
are  pseudoalleles,  the  above  estimates  of  heterozygosity  do  not  refer  to  the 
genic  variation  at  a locus,  and.  the  average  heterozygosity  per  locus  would  be 
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reduced  drastically.  Bodmer  speculates  from  the  recombination  data  between 
the  LA  and  4 loci  that  if  the  whole  chromosome  segment  between  the  two 
loci  is  concerned  with  the  HL-A  antigen  formation,  at  least  hundreds  of 
cistrons  are  involved.  Clearly,  more  detail  of  the  molecular  biology  of  these 
antigens  and  their  genes  should  be  known  before  any  meaningful  study  on 
the  population  genetics  of  these  loci  can  be  made.  There  are  some  other 
antigenic  polymorphisms  detected  on  white  blood  cells  such  as  the  5,  NA, 
and  Zw  systems.  The  genetics  of  these  systems  is  less  complicated  than  the 
HL-A  system  and  similar  to  that  of  the  red  blood  cells  (Cavalli-Sforza  and 
Bodmer,  1971). 

3)  Immunoglobulins 

Immunoglobulins  are  the  antibody  substances  which  are  formed  in  lympho- 
cytes in  reaction  to  antigenic  foreign  materials  such  as  viruses  and  bacteria 
in  vertebrate  organisms.  The  immunoglobulin  molecule  is  composed  of  two 
identical  heavy  chains  and  two  identical  light  chains  of  polypeptides  with  a 
different  amount  of  carbohydrate  attached.  In  man  there  are  five  different 
classes  of  immunoglobulins  which  can  be  distinguished  according  to  their 


Table  6.6 


Human  immunoglobulin  chains.  From  Gaily  and  Edelman  (1972). 


Light  chains 

Heavy  chains 

Designation 

K 

2 

y 

a 

P 

6 

e 

(kappa) 

(lambda) 

(gamma) 

(alpha) 

(mu) 

(delta)  (epsilon) 

Classes  in 

All  classes 

IgG 

IgA 

IgM 

IgD 

IgE 

which 

chains 

occur 

Isotypic  or 

None 

Oz+, 

1 4 

u 

1,2 



sub-class 

Oz“ 

variants 

Kern+, 

Kern- 

Allotypic 

InV  1,  2,  3 

Gm  (1-23) 

Ami,  Am2 

— 

— 

— 

variants 

Molecular 

22,000 

22,000 

50,000 

50,000 

58,000 

56,000 

61,000 

weight 

Variable 

Vki-V/cih 

V Ax— V Ay 

Vm- 

-Vhut 

region 

sub-groups 
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overall  molecular  structure  and  physiological  properties.  Most  of  the 
immunoglobulins  produced  in  man  belong  to  the  class  IgG.  The  light  chains 
(composed  of  about  220  amino  acids)  can  be  classified  into  two  types,  k-  and 
2-chains,  while  the  heavy  chains  (composed  of  about  400  amino  acids)  into 
five  types,  y-,  a-,  p-,  6-,  and  e-chains  (see  table  6.6).  Each  class  of  immuno- 
globulin contains  a characteristic  type  of  heavy  chain.  Thus,  the  five  classes 
of  immunoglobulins  IgG,  IgA,  IgM,  IgD,  and  IgE  have  the  y,  a,  p,  6,  and 
e heavy  chains,  respectively.  On  the  other  hand,  the  light  chains,  and  A 
occur  in  all  classes  of  immunoglobulins.  Therefore,  IgG,  for  example,  has 
either  the  molecular  form  or  X2y 2.  A further  complication  is  that  each 
of  the  light  and  heavy  chains  is  composed  of  constant  and  variable  regions. 
The  constant  region  has  the  same  amino  acid  sequence  for  a variety  of 
antigens,  while  the  amino  acid  sequence  of  the  variable  region  varies  with 
each  different  kind  of  antibody. 

The  genetic  control  of  immunoglobulin  synthesis  is  one  of  the  most 
fascinating  subjects  in  current  eukaryote  genetics  and  an  intensive  study  is 
now  underway.  Yet  the  detail  of  the  control  still  remains  to  be  clarified.  An 
excellent  review  of  the  current  status  of  immunogenetics  has  been  given  by 
Gaily  and  Edelman  (1972)  and  Grubb  (1971).  For  our  purpose,  only  a brief 
account  is  sufficient.  It  is  now  generally  accepted  that  in  man  there  are  at 
least  four  closely  linked  loci  which  control  the  constant  region  of  the  y-chain 
(Cyl,  Cy2,  C,„  C„),  while  there  are  at  least  two  loci  which  code  for  the 

variable  region  of  each  polypeptide  chain.  All  the  genes  controlling  these 
polypeptides  seem  to  have  evolved  from  a single  ancestral  gene  by  gene 
duplication. 

At  least  at  several  of  the  immunoglobulin  loci  there  are  genetic  poly- 
morphism-in  the  same  population.  The  most  well  known  polymorphisms 
in  man  are  the  InV  and  Gm  systems,  which  are  due  to  the  allelic  variation 
in  the  constant  regions  of  the  k-  and  y-chains,  respectively.  It  is  known  that 
these  two  systems  are  inherited  independently  of  each  other.  In  practice, 
however,  the  polymorphisms  in  these  loci  are  studied  by  immunological 
methods  rather  than  by  amino  acid  sequencing  of  the  immunoglobulins. 
Therefore,  the  relationship  between  the  immunological  ‘factor’  and  gene 
structure  is  not  well  known  except  in  some  special  cases.  The  difference 
between  the  InV  factors  InV(-l,  -2)  and  InV(l,  2)  corresponds  to  a single 
amino  acid  interchange  of  valine  and  leucine  at  position  191  of  the  K-chain. 
Also,  several  of  the  Gm  factors  have  been  correlated  to  one  of  the  four  loci 
responsible  for  the  y-chains. 

At  any  rate,  by  the  immunological  method  three  different  factors  for  the 
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InV  system  and  23  for  the  Gin  system  have  been  identified  in  man.  These 
factors  do  not  represent  true  allelic  differences  but  probably  constitute 
pseudoalleles  as  in  the  case  of  the  Rh  locus.  The  population  frequencies  of 
these  factors  have  been  studied  extensively,  and  a large  amount  of  poly- 
morphism has  been  discovered.  Just  like  the  histocompatibility  loci,  how- 
ever, we  cannot  determine  the  level  of  gene  diversity  for  these  loci,  since  a 
locus  cannot  be  clearly  defined  by  immunological  methods.  Nevertheless, 
the  recent  progress  in  immunogenetics  has  made  one  thing  clear:  The  high 
degree  of  heterogeneity  in  immunoglobulins  in  vertebrates  is  apparently 
controlled  by  a relatively  small  number  of  genes  in  the  genome  not  by  a large 
number.  How  such  a system  evolved  is  not  well  understood  at  the  present 
time  (cf.  ch.  8). 


6.4  Gene  diversity  in  subdivided  populations 

In  the  foregoing  section  we  discussed  the  gene  diversity  within  populations. 
Natural  populations  are,  however,  generally  divided  into  a number  of  sub- 
populations. It  is  therefore  desired  to  study  the  gene  diversities  within  and 
between  populations.  The  analysis  of  gene  diversity  in  the  total  population 
into  its  components  can  be  made  by  the  following  method,  which  is  applicable 
to  any  organism,  whether  it  is  sexually  or  asexually  reproducing  or  whether 
it  is  diploid  or  nondiploid,  as  far  as  gene  frequencies  can  be  determined 
(Nei,  1973c).  It  is  also  applicable  to  any  situation  without  regard  to  the 
number  of  alleles  per  locus  and  the  pattern  of  evolutionary  forces  such  as 
mutation,  selection,  and  migration.  It  is  different  from  Wright's  (1943,  1951, 
1965)  method  of  F-statistics  which  are  intended  to  measure  the  deviations  of 
genotype  frequencies  from  Hardy -Weinberg  proportions.  It  is  also  different 
from  Cockerham's  (1973)  analysis  of  gene  frequencies,  which  is  essentially 
the  same  as  the  method  of  F-statistics.  Our  measures  of  gene  diversity  are 
not  related  to  genotype  frequencies  except  in  randomly  mating  populations. 
In  other  words,  we  disregard  the  distribution  of  genotype  frequencies  within 
populations. 

The  following  theory  is  intended  to  be  applied  to  the  average  gene  diversity 
for  a large  number  of  loci,  but  for  simplicity  we  consider  a single  locus. 
The  results  obtained  are  directly  applicable  to  the  average  gene  diversity. 
For  this  reason,  we  shall  use  the  notations  for  the  average  gene  diversity  and 
identity  rather  than  those  for  a single  locus. 

Consider  a population  which  is  subdivided  into  s subpopulations.  Let 
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be  the  frequency  of  the  k-th  allele  in  the  i-th  subpopulation.  The  gene 
identity  (1  - gene  diversity)  in  this  subpopulation  is  given  by  Jj-  = 
while  the  gene  identity  in  the  total  population  is 

Jr  -1*1  (6.1) 

k 

where  x.k  = Ei*^  The  quantity  JT  may  be  written  as 

Jt  = { 1 1 4 + I I 

\ i I \*i  k / 

= + E.J^yj1! 

where  Ji}  = is  the  gene  identity  between  the  i-th  and  j-th  sub- 

populations. 

Let  us  now  define  the  gene  diversity  between  the  i-th  and  j-th  populations 
as 

Di j - fly  - {Ht  + Hj)! 2 

= (J,  + J j)/2  - Jijf  (6-8) 

where  Ht  = 1 - Jt  and  Hi}  = 1 - Jtj.  This  quantity  is  identical  to  the 
minimum  estimate  of  net  codon  differences  between  two  populations,  which 
will  be  defined  in  the  next  chapter  (7.1).  Note  that  is  Efc^ji  — *r*m 
so  that  it  is  nonnegative.  If  we  use  (6.8)  and  note  that  Dn  = 0,  JT  reduces  to 

Jr  — ^e  j*ys  " ^e  e 

^ Js  — Ti, 

where  Js  is  the  average  gene  identity  within  subpopulations,  and  DST  is 
the  average  gene  diversity  between  subpopulations,  including  the  com- 
parisons of  subpopulations  with  themselves.  The  gene  diversity  in  the  total 
population  (H,  = 1 — JT)  is 

HT  = Hs  + Dst,  (6.9) 

where  Hs  = 1 - Js.  Thus,  the  gene  diversity  in  the  total  population  can  be 
analyzed  into  the  gene  diversities  within  and  between  subpopulations.  As 
mentioned  earlier,  the  above  formula  holds  true  for  the  average  gene 
diversity  for  any  number  of  loci,  hi  fact,  in  order  to  know  a general  picture 
of  gene  differentiation  among  subpopulations,  a large  number  of  loci  which 
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are  a random  sample  of  the  genome  should  be  used,  including  both  poly- 
morphic and  monomorphic  loci. 

The  relative  magnitude  of  gene  differentiation  among  subpopulations  may 
be  measured  by 

Gst^Dst(Ht.  (6. 10} 

This  varies  from  0 to  1 and  will  be  called  the  coefficient  ofgencdiffcrcntiation. 
A formula  for  the  approximate  sampling  variance  of  Gst  has  been  given  by 
Chakraborty  (1974). 

From  (6.9)  and  (6.10)  we  obtain  the  equation 

a -Gst)0  - i7)  - 1 - Js.  (6k1I) 

This  is  different  from  Wright's  well-known  formula  1 Fit  — (1  FIS) 
(1  - Fst),  where  FIT  and  FIS  are  the  correlations  between  two  uniting 
gametes  to  produce  the  individuals  relative  to  the  total  population  and 
relative  to  the  subpopulations,  respectively,  while  Fst  is  the  correlation 
between  two  gametes  drawn  at  random  from  each  subpopulation.  The 
difference  occurs  because  FIS  and  Fir  measure  the  deviations  of  genotype 
frequencies  from  Hardy-Weinberg  proportions,  while  Js  and  JT  are  gene 
identities.  Note  also  that  FIS  and  Fit  may  become  negative  but  Js  and  JT 
are  nonnegative.  On  the  other  hand,  GST  is  equivalent  to  Fst , which  never 
becomes  negative.  In  fact,  GST  is  identical  to  FST  in  (5.9)  if  there  are  only 
two  alleles  at  a locus,  since  in  this  case  DST  = 2VX  and  HT  = 2x(l  - x) 
where  x and  Vx  are  the  mean  and  variance  of  the  frequency  of  an  allele. 
Furthermore,  Wright  (personal  communication)  has  shown  that  in  the 
presence  of  multiple  alleles  Gsr  is  equal  to  a weighted  mean  of  FST  for  all 
alleles,  i ■e-FST  = - xjFsrm/£^([  - xf),  where  i refers  to  the  i-th 

allele.  Thus,  GST  is  regarded  as  an  extension  of  Fst- 

Although  Gst  is  a good  measure  of  the  relative  degree  of  gene  differentia- 
tion among  subpopulations,  it  is  highly  dependent  on  the  value  of  HT. 
When  this  is  small,  GST  may  be  large  even  if  the  absolute  gene  differentiation 
is  small.  The  absolute  degree  of  gene  differentiation  may  be  measured  by 

K s I - i)} 

E*J 

- “ 1)-  (6,12) 
This  measure  is  an  estimate  of  minimum  net  codon  differences  between 
populations  and  independent  of  the  gene  diversity  within  subpopulations, 
and  thus  it  can  be  used  for  comparing  the  degrees  of  gene  differentiation  in 
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different  organisms.  Dm  may  also  be  used  to  compute  the  interpopulational 
gene  diversity  relative  to  the  intrapopulational  gene  diversity.  That  is, 

Rst  = (6.13) 

Formula  (6.9)  can  easily  be  extended  to  the  case  where  each  subpopulation 
is  further  subdivided  into  a number  of  colonies.  In  this  case  Hs  may  be 
analyzed  into  the  gene  diversities  within  and  between  colonies  ( Hc  and 
Dcs,  respectively).  Therefore, 

Hy  — He  + Oq  + Djj‘.  (6  J 4) 

This  sort  of  analysis  can  be  continued  to  any  degree  of  hierarchical  sub- 
division. The  relative  degree  of  gene  differentiation  attributable  to  colonies 
within  subpopulations  can  be  measured  by  trCiS(T)  = DcsjHT.  It  can  also 
be  shown  that  (1  - Gcs)(l  - GST)HT  = Hc,  where  Gcs  = Dcs/Hs. 

The  above  method  has  been  applied  to  various  organisms  (table  6.7).  The 
estimates  of  HJy  Hs , Gsr,  and  Dm  for  the  three  major  races  of  man,  Cauca- 
soids, Negroids,  and  Mongoloids,  were  obtained  from  the  35  common 
protein  loci  used  in  estimating  the  gene  diversity  per  locus  for  each  major 
race.  Using  the  mean  gene  frequency  of  each  allele  at  the  35  loci  for  the  three 
races,  we  obtain  HT  = 0.130,  while  the  estimate  of  Hsis  0.121,  which  is  equal 
to  the  mean  of  the  three  gene  diversity  estimates  in  table  6.1.  Therefore, 


Table  6.7 


Analysis  of  gene  diversity  and  degree  of  gene  differentiation  among  local  populations  of 
various  organisms. 


Population 

No. 

of 

loci 

Ht 

HS 

Gst 

A* 

Man  - 3 major  races3 

35 

0.130 

0.121 

0.070 

0.014 

Yanomama  Indians  - 37  villages" 

15 

0.039 

0.036 

0.069 

0.003 

House  mouse  - 4 populations3 

40 

0.097 

0.086 

0.119 

0.015 

Dipodumys  ordii  - 9 populations'1 

18 

0.037 

0.012 

0.674 

0.028 

Drosophila  equinoxialis  - 5 populations0 

27 

0.201 

0.179 

0.109 

0.026 

Horseshoe  crab  - 4 populations11 

25 

0.066 

0.061 

0.072 

0.006 

LyoipotHtiin  lucidiihim  - 4 populations^ 

13 

0.071 

0.051 

0.284 

0.027 

■ Nei  and  Roychoudhury  (1974b);  Q Weitkamp  ct  al.  (1972),  Wcitkamp  and  Neel  (1972); 
c Selandcr  et  al.  (1969);  11  Johnson  and  Selander  (1971);  r Ayala  et  al.  (1974); r Selander 
ct  al.  (1970);  11  Levin  and  Crepet  (1973). 
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Dst  = 0.009  and  Dm  = 3DST/2  = 0.014.  Namely,  the  minimum  net  codon 
differences  between  the  three  races  are  estimated  to  be  0.014  per  locus. 
On  the  other  hand,  the  estimate  of  Gsr  is  0.070,  so  that  only  7 percent  of  the 
total  gene  diversity  is  attributable  to  the  gene  differences  between  races. 

Table  6.7  indicates  that  both  Hr  and  Hs  vary  considerably  with  organism. 
The  value  of  Gs  r also  varies.  In  man  and  the  horseshoe  crab  it  is  about  0.07, 
but  in  Oipodomys  ordii  GST  is  as  high  as  0.69,  so  that  about  70  percent  of 
genic  variation  in  the  total  population  is  due  to  interpopulational  gene 
differences.  The  large  value  of  GST  in  D.  ordii  is,  however,  due  to  the  small 
value  of  Hs  in  this  organism,  and  D m,  the  absolute  measure  of  gene  differen- 
tiation, is  not  so  large.  In  terms  of  the  gene  differences  between  local 
populations  seem  to  be  about  0.03  or  less  in  most  organisms. 

When  there  is  more  than  one  level  of  hierarchical  subdivisions,  one  might 
ask  how  the  genic  variation  is  apportioned  within  and  between  them.  For 
example,  the  world  population  of  man  can  be  divided  into  several  races  and 
each  race  can  further  be  subdivided  into  several  populations.  Lewontin 
(1972)  studied  the  apportionment  of  genic  variation  within  and  between 
these  subdivisions  by  using  the  Shannon  information  measure.  He  divided 
the  total  human  population  into  seven  races,  Caucasians,  Africans,  Mongo- 
loids, South  Asian  Aborigines,  Amerinds,  Oceanians,  and  Australian 
Aborigines,  each  race  consisting  of  several  populations.  The  gene  frequency 
data  used  are  those  of  17  polymorphic  loci  (mostly  blood  groups).  His 
result  is:  About  86  percent  of  the  total  genic  variation  in  man  exists  within 
populations,  about  8 percent  between  populations  within  races,  and  only 
about  6 percent  between  races.  Although  the  Shannon  information  measure 
is  not  related  to  any  genetic  entity,  it  is  expected  that  a similar  conclusion 
will  be  obtained  by  the  analysis  of  gene  diversity  in  this  case.  In  fact,  this 
result  is  virtually  the  same  as  our  earlier  conclusion  about  racial  gene 
differences  in  man  (see  also  Nei  and  Roychoudhury,  1972). 

Another  example  of  the  apportionment  of  genic  variation  within  and 
between  hierarchies  has  been  provided  by  Roychoudhury  (unpublished), 
who  analyzed  the  gene  diversity  in  the  American  Indian  population  into  the 
gene  diversities  within  (Hc)  and  between  (Dcs)  villages  within  tribes  and 
between  tribes  ( DST ) by  using  formula  (6.14).  In  this  study  only  three  tribes 
(Papago,  Makiritare,  and  Yanomama)  were  used,  so  that  the  results  ob- 
tained may  not  apply  to  the  whole  American  Indian  population.  Of  the  13 
loci  used,  1 1 were  blood  group  loci.  Altogether,  1 1 loci  were  polymorphic 
at  least  in  one  of  the  three  tribes.  Thus,  the  loci  used  were  clearly  deviated 
from  a random  sample  of  the  genome.  The  results  of  gene  diversity  analysis 
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Table  6.8 

Analysis  of  gene  diversity  in  three  American  Indian  tribes.  From  Roychoudhury  (un- 
published). 


Tribe 

No.  of 

subpopulations 

No.  of 
loci* 

Hs 

He 

Dcs 

Gcs 

Papago 

10 

13 

0.301 

0.294 

0.007 

0.008 

0.023 

Makiritare 

7 

13 

0.332 

0.316 

0.015 

0.018 

0.045 

Yanomama 

37 

13 

0.243 

0.225 

0.018 

0.019 

0.074 

Mean  (unweighted) 

0.292 

0.278 

0.013 

0.015 

* The  loci  used  are  those  for  blood  groups  ABO,  MN,  Ss,  Rh(C),  Rh(D),  Rh(E),  P,  Jk, 
Fy,  Di,  and  K and  proteins  haptoglobin  and  gioup  specific  component. 


in  each  tribe  are  given  in  table  6.8.  It  is  seen  that  Hs  is  more  or  less  the  same 
for  the  three  tribes  but  the  Da  value  in  Papago  is  about  half  those  of  Maki- 
ritare  and  Yanomama.  By  using  the  unweighted  mean  gene  frequencies  for 
each  tribe,  we  can  estimate  HT,  which  becomes  0.316.  On  the  other  hand,  the 
estimates  of  Hc  and  Dcs  are  0.278  and  0.013,  respectively.  Therefore,  Dsr  is 
estimated  to  be  0.024.  Thus,  88  percent  ( HcjHT ) of  the  gene  diversity  in  the 
American  Indian  population  exists  within  villages,  while  the  gene  diversities 
between  villages  within  tribes  \DcsfH T)  and  between  tribes  ( DST/HT ) are 
about  4 and  8 percents,  respectively.  This  result  confirms  and  extends 
Lewontin's  conclusion  that  a large  part  of  the  genic  variation  in  man  exists 
within  small  units  of  populations  and  the  interpopulational  gene  variation 
is  rather  small.  Table  6.7  indicates  that  this  conclusion  holds  also  for  other 
organisms,  excluding  the  highly  inbred  species  Dipodomys  ordii. 

6.5  Mechanisms  of  maintenance  of  protein  polymorphisms 

In  the  foregoing  sections  we  have  seen  that  natural  populations  contain  a 
large  amount  of  genic  variability  which  can  be  revealed  only  by  genetic  and 
biochemical  techniques.  How  this  high  degree  of  genicvariation  is  maintained 
in  populations  is  one  of  the  central  problems  in  population  genetics  at  present. 
As  noted  earlier,  there  are  two  types  of  polymorphism,  stable  and  transient. 
Stable  polymorphisms  are  maintained  by  balancing  selections  as  discussed  in 
ch.  4,  and  theoretically  they  will  persist  in  the  population  indefinitely  unless 
the  selective  forces  change,  On  the  other  hand,  transient  polymorphisms 
can  be  divided  into  two  classes,  i.e.,  selective  and  nonselective  or  neutral. 
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The  former  occur  in  the  process  of  gene  substitution  by  natural  selection, 
while  the  latter  occur  when  neutral  mutations  increase  in  frequency  by 
random  genetic  drift.  In  practice,  of  course,  the  above  distinctions  are  not 
always  easy  and,  as  we  have  seen,  genetic  drift  often  dictates  gene  frequency 
changes  in  small  populations  even  if  selection  is  fairly  strong.  For  this 
reason,  Kimura  (1968b)  has  defined  the  neutrality  of  a gene  in  relation  to 
population  size.  According  to  him,  a gene  is  called  neutral  if  the  selection 
coefficient  for  heterozygotes  or  homozygotes  for  the  gene  is  much  less  than 
1 / Ne,  where  Ne  is  the  effective  population  size. 

Another  difficulty  is  that  if  we  study  a large  number  of  loci,  there  must  be 
always  at  least  some  neutral  or  some  selective  genes  segregating  in  a popula- 
tion. Thus,  it  would  be  foolish  to  ask  an  all  or  none  question  about  the 
mechanism  of  maintenance  of  polymorphism.  The  question  generally  asked 
is  therefore  whether  the  majority  of  polymorphisms  in  a population  are 
stable  or  not  (or  neutral  or  not).  Ultimately,  this  question  should  be  answered 
in  terms  of  proportions  but  at  the  present  time  it  is  almost  impossible  to 
know  the  proportions  of  different  kinds  of  polymorphisms  in  natural 
populations. 

6.5.1  Overdonzinance  hypothesis 

Overdominance  is  one  of  the  simplest  hypotheses  by  which  the  stable  genetic 
polymorphism  in  large  populations  can  be  explained.  Until  recently,  many 
polymorphisms  identified  at  the  morphological  level  were  thought  to  be 
stable  and  maintained  by  the  overdominant  effect  of  the  gene  concerned 
(Ford,  1964).  This  was  partly  due  to  the  brilliant  demonstration  by  Allison 
(1955)  and  his  group  that  the  sickle-cell  gene  polymorphism  in  the  Negroid 
populations  of  Africa  is  caused  by  a stronger  resistance  of  mutant  hetero- 
zygotes to  malaria  than  normal  homozygotes  while  the  mutant  homozygotes 
have  a low  fitness  because  of  the  sickle  cell  anemia.  It  was  therefore  natural 
that  when  Lewontin  and  Hubby  (1966)  discovered  a large  amount  of  genetic 
variability  at  the  protein  level,  they  first  tested  the  possibility  of  overdominant 
selection.  Their  data  suggested  that  about  30  percent  of  loci  of  the  genome  of 
Drosophila  pseudoobscura  are  polymorphic.  This  corresponds  to  1000  poly- 
morphic loci,  if  this  organism  has  3000  structural  genes.  The  theory  of 
genetic  load  by  Morton  et  al.  (1956)  and  Crow  (1958)  indicates  that  the 
maintenance  of  1000  overdominant  loci  incurs  a large  amount  of  genetic 
load  and  each  individual  must  have  an  extremely  high  fertility. 

Let  us  consider  this  problem  in  some  detail.  Denote  the  fitnesses  of  geno- 
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types  A,  A 2>  and  A2A2  at  a locus  by  1 - j’1,l,andl  - s2,  respectively. 

The  equilibrium  gene  frequency  of  A,  is  then  given  by  = s2/(sl  + .s^) 
(eh.  4),  and  the  mean  fitness  at  equilibrium  is  W = x^l  - sq)  + 2x1(l  - x j) 
+ (1  - - s2)  = 1 - + s2).  Therefore,  the  mean  fitness  is 

lower  by 


L=  W(*i  +s2)  (6.15) 

compared  with  that  of  a hypothetical  population  of  heterozygotes  only.  This 
reduction  in  mean  fitness  is  called  genetic  load.  This  means  that  in  order 
to  maintain  the  polymorphism  without  reducing  population  size  the  popula- 
tion must  have  a fertility  excess  enough  to  offset  this  genetic  load.  Namely, 
the  average  fertility  of  an  individual  must  be  1 + L or  larger.  For  example, 
ifsj  = s2  = 0.1,  then  L = 0.05.  If  1000  loci  have  this  magnitude  of  load  on 
the  average  and  gene  action  is  independent,  the  total  genetic  load  is  50.  T o 
maintain  a constant  size,  therefore,  the  population  must  have  a fertility  of 
at  least  (1  + 0.05)'  000  ar  e50  = 5 x It)20  offspring  per  individual,  which  is 
certainly  much  higher  than  the  actual  fertility  of  Drosophila  in  nature. 
Because  of  this  extremely  high  fertility  excess  required,  Lewontin  and 
Hubby  (1966)  rejected  the  overdominance  hypothesis.  In  this  paper  they 
also  examined  other  possible  mechanisms  but  could  not  reach  any  definite 
conclusion. 

In  the  above  computation  we  used  the  model  of  constant  fitness,  but  the 
same  result  can  be  obtained  with  the  model  of  competitive  selection.  In  ch.  4 
we  have  shown  that  the  fitnesses  of  A1A1,  AyA2,  and  A2A2  under  competi- 
tive selection  are  given  by  Wn  = 1 + 2xxxzsl  + x\s2,  Wly  = 1 - xp1  + 
and  W22  = 1 - xfs2  - respectively.  In  the  case  of  over- 

dominance we  put  s,  = - s;  and  s,  = - s;  + s„  where  s;,  s„  and  s, 
are  all  positive.  Therefore,  we  have  WX1  = 1 - (1  + 

W 1 2 = 1 + and  W22  = 1 + - xt(l  + Since  the 

equilibrium  gene  frequency  of  A,  is  xx  = + ^3)  from  (4.60),  the 

minimum  fertility  required  for  maintaining  this  polymorphism  is  1 + Jfcj.tJ  + 
= 1 + + s3).  This  indicates  that  the  fertility  excess  required 

for  maintaining  overdominant  polymorphism  is  the  same  whether  it  is  due 
to  competitive  selection  or  noncompetitive  selection. 

Soon  after  Lewontin  and  Hubby's  paper  appeared,  Sved  et  al.  (1967), 
King  (1967),  and  Milkman  (1967)  published  models  of  truncation  selection 
with  overdominance,  in  which  a relatively  small  amount  of  fertility  excess 
is  required  for  maintaining  a large  number  of  polymorphic  loci.  The  models 
of  selection  published  by  these  authors  are  more  or  less  the  same:  selection 
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occurs  according  to  a certain  underlying  scale,  the  value  of  which  is  deter- 
mined by  the  total  number  of  heterozygous  loci  per  individual  and  some 
environmental  effects.  Individuals  whose  value  in  this  scale  is  larger  than  a 
certain  threshold  level  are  saved  to  form  the  adults  for  the  next  generation. 
Clearly,  this  is  a direct  application  of  the  theory  of  artificial  selection  for  a 
quantitative  character.  As  discussed  earlier,  however,  it  is  quite  unlikely 
that  natural  selection  occurs  according  to  this  scheme. 

In  finite  populations,  however,  the  fertility  excess  required  for  main- 
taining a given  number  of  overdominant  loci  (r)  under  competitive  selection 
seems  to  be  somewhat  smaller  than  that  required  for  an  infinitely  large 
population,  even  if  each  gene  acts  independently  (Kimura  and  Ohta,  1971b). 
This  is  because  in  a relatively  small  population  the  individuals  whose 
number  of  heterozygous  loci  is  very  large  would  never  appear  if  r is  large. 
For  example,  if  r = 40  and  the  probability  of  heterozygosity  at  a locus  is 
1/2,  then  the  probability  of  getting  an  individual  heterozygous  for  all  loci 
is  2 40  or  one  in  a trillion.  In  a finite  population,  therefore,  such  extreme 
individuals  can  be  disregarded.  Kimura  and  Ohta  (1971b)  computed  the 
fitness  required  for  'the  most  probable  extreme  individual'  in  a population 
of  size  N.  They  show  that  if  sx  = s2  = 0.1,  r = 1000,  and  N = 25,000, 
then  the  population  must  have  a fertility  of  898  offspring  per  individual  to 
maintain  a constant  size.  This  value  is  much  smaller  than  5 x 1020  obtained 
earlier,  though  it  is  still  much  higher  than  the  actual  fertility  in  most  mam- 
malian species.  On  the  other  hand,  if  sL  = s2  = 0.01,  r and  N remaining 
the  same,  the  average  fertility  required  becomes  1.97  offspring  per  individual. 
This  suggests  that  if  selection  coefficient  is  small  a large  number  of  loci  may 
be  maintained  by  overdominant  selection,  Kimura  and  Ohta’s  computation 
is,  however,  somewhat  questionable,  since  they  compute  the  fitness  of  the 
most  probable  extreme  individual  after  deriving  the  variance  of  fitness  using 
the  model  of  unlimited  fertility,  as  in  the  case  of  substitutional  load.  Further- 
more, for  noncompetitive  selection  the  stochastic  elements  in  finite  popu- 
lations are  known  to  increase  the  genetic  load  due  to  overdominance 
(Kimura  and  Crow,  1964;  Nei  and  Imaizumi,  1966b;  Robertson,  1970). 
At  any  rate  it  seems  difficult  at  the  present  time  to  exclude  the  possibility 
of  overdominant  polymorphism  on  the  basis  of  the  load  argument  alone. 
Of  course,  this  is  not  proof  of  the  existence  of  overdominant  polymorphism 
either. 

I have  already  mentioned  that  some  hemoglobin  and  G6PD  mutations 
in  man  are  apparently  maintained  by  overdominance.  There  are  several 
other  cases  in  which  the  maintenance  of  a particular  protein  or  enzyme 
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polymorphism  has  been  ascribed  to  overdominance.  One  such  case  is  that 
of  an  esterase  locus  in  the  freshwater  fish  Catostomus  clarkii  in  the  Colorado 
River  system.  Koehn  and  Rasmussen  (1967)  and  Koehn  (1969)  showed  that 
the  frequency  of  the  Es-P  allele  decreases  with  increasing  latitude,  while 
that  of  the  Es-Ib  gene  increases.  Thus,  the  frequencies  of  the  Es-P  allele  in 
southern  Arizona,  central  Arizona,  southwestern  Utah,  and  northern  Nevada 
are  1.00,  0.84  ~ 0.92,  0.46  ~ 0.60,  and  0.17,  respectively.  Koehn  (1969) 
showed  that  this  cline  is  correlated  with  temperature-dependent  Es-I 
activity  of  the  three  possible  genotypes.  At  high  temperatures  (20°-40°C) 
Es-I  enzymes  from  genotype  Es-P/Es-P  have  a higher  activity  than  those 
from  genotype  Es-PjEs-P , while  at  low  temperatures  (0"-20°C)  the  enzymes 
from  the  latter  genotype  have  a higher  activity  than  those  from  the  former. 
On  the  other  hand,  the  enzymes  from  heterozygotes  have  a higher  activity 
than  those  from  both  homozygotes  at  intermediate  temperatures.  Thus,  as 
Koehn  assumed,  it  is  probable  that  these  differences  in  enzyme  activity  are 
responsible  for  the  maintenance  of  the  polymorphic  cline.  To  prove  this 
hypothesis,  however,  it  is  necessary  to  examine  the  fitnesses  of  the  three 
genotypes  directly. 

Heterozygote  advantage  has  also  been  reported  by  Frelinger  (1972)  in 
pigeon  transferrins.  He  showed  that  the  eggs  laid  by  heterozygotes  for  this 
locus  show  a significantly  higher  hatchability  than  those  laid  by  homo- 
zygotes  and  this  is  apparently  due  to  the  fact  that  ovotransferrins  from 
heterozygous  females  inhibit  microbial  growth  better  than  those  from 
homozygotes. 

Schaal  (1974)  studied  the  heterozygote  frequencies  in  different  age  groups 
of  Liatris  cylindracea,  and  showed  that  they  increase  with  increasing  age  at 
many  enzyme  loci.  A similar  result  has  been  obtained  by  Fujino  and  Kang 
(1968)  at  the  transferrin  locus  in  the  skipjack  tuna  and  by  Tinkle  and  Selan- 
der  (1973)  at  the  esterase-1  locus  in  the  sagebrush  lizard.  These  data  suggest 
that  there  is  heterozygote  advantage,  though  it  is  not  clear  whether  it  is  due 
to  true  overdominance  or  associative  overdominance. 

Marshall  and  Allard  (1970b)  reported  apparent  overdominance  in  enzyme 
polymorphism  of  the  wild  oat  Avena  barhata.  This  plant  reproduces  mainly 
by  self-fertilization  but  outcrosses  with  a frequency  of  a few  percent.  In 
one  population  the  average  rate  of  outcrossing  was  estimated  to  be  t = 
0.014.  If  there  is  no  selection,  the  genotype  frequencies  for  a pair  of  alleles, 
A , and  A„  at  equilibrium  are  given  by 
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Genotype  Frequency 

AtAt  (l  - F)X*  + Fx 

A,A2  2(1  - F)M\  - x) 

A2A2  (1  - F)(l  — .v)1  + F(I  -x) 

where  x is  the  frequency  of  allele  A , and  F is  the  inbreeding  coefficient  at 
equilibrium  and  given  by  (1  - /)/( 1 + t).  Thus,  with  t = 0.014  we  expect 
that  j F = 0.97.  On  the  other  hand,  the  observed  values  of  F for  the  four 
enzyme  loci  examined  (E4,  E11S,  P,,  APX,)  were  0.70  ~ 0.78.  This  indicates 
that  the  observed  frequency  of  heterozygotes  is  considerably  higher  than  the 
expected.  Marshall  and  Allard  ascribed  this  difference  to  overdominance  and 
predicted  that  the  fitness  of  homozygotes  is  about  one  half  of  the  hetero- 
zygote fitness.  Again,  however,  this  prediction  has  not  been  tested  experi- 
mentally. Furthermore,  as  indicated  by  S.  K.  Jain  (personal  communication), 
if  outcrossing  rate  fluctuates  from  year  to  year,  the  expected  frequency  of 
heterozygotes  can  be  much  higher  than  that  given  by  the  above  formula, 
even  if  the  mean  value  oft  remains  the  same.  This  is  because  random  mating 
restores  the  Hardy -Weinberg  equilibrium  in  one  generation,  while  selfing 
reduces  the  frequency  of  heterozygotes  only  by  a half  in  each  generation. 
In  fact,  the  t value  in  this  organism  seems  to  vary  considerably  with  en- 
vironmental condition  (see  Marshall  and  Allard,  1970b).  It  should  also  be 
noted  that  in  selfing  organisms  associative  overdominance  due  to  linked 
detrimental  genes  may  be  developed  (Ohta,  1971;  Ohta  and  Cockerham, 
1974). 

Several  authors  (e.g.  Prakash  et  al.,  1969;  Ayala  et  al.,  1971)  studied  the 
gene  frequencies  at  protein  loci  in  various  locations  in  the  territory  of  an 
organism  and  found  that  the  gene  frequency  of  a particular  allele  is  often 
very  similar  even  for  distantly  located  populations.  These  data  were  first 
interpreted  as  evidence  for  overdominant  or  some  other  form  of  stabilizing 
selection,  since  if  there  is  little  migration  between  populations  and  if  there  is 
no  selection,  the  gene  frequency  in  a population  would  be  affected  by  random 
genetic  drift  and  vary  from  location  to  location.  However,  Maruyama 
(1970b,  c)  and  Kimura  and  Maruyama  (1 971)  showed  that  even  if  there  is  no 
selection,  the  differentiation  of  gene  frequency  among  local  populations  is 
very  small  if  individuals  are  distributed  two-dimensionally  and  migration 
between  adjacent  populations  is  sufficiently  large,  so  that  Nm  > 1,  where  N 
is  the  number  of  individuals  per  population  and  m is  the  migration  rate 
between  two  adjacent  populations  per  generation.  Since  the  condition  Nm  > 
1 would  be  satisfied  in  many  organisms,  similarity  of  gene  frequencies  in 
distant  populations  is  not  necessarily  evidence  for  overdominant  selection. 
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In  a computer  simulation  Franklin  and  Lewontin  (1970)  discovered  that 
if  many  overdominant  loci  with  multiplicative  fitnesses  are  closely  linked 
on  a chromosome,  they  produce  strong  linkage  disequilibria  among  loci 
and  often  only  two  types  of  chromosomes  with  complementary  gene  ar- 
rangements are  formed.  Slatkin  (1972)  provided  a theoretical  background 
for  this  finding.  Similar  strong  linkage  disequilibria  were  also  observed  in 
Wills  et  al.’s  (1970)  computer  simulation  of  truncation  selection  with  over- 
dominance. Observation  of  gamete  frequencies  by  Mukai  et  al.  (1971,  1974) 
and  Langley  et  al.  (1974),  however,  showed  that  the  enzyme  or  protein  loci 
show  little  linkage  disequilibria  in  natural  populations.  Charlesworth  and 
Charlesworth  (1973)  claimed  that  the  linkage  disequilibria  they  found  in 
four  cases  out  of  the  thirty  examined  are  due  to  selection.  Howevei,  their 
data  can  easily  be  explained  by  genetic  drift  and  migration  before  sampling. 
It  should  be  noted  that  there  are  many  ways  in  which  linkage  disequilibria 
are  generated  without  selection  (Hill  and  Robertson,  1968;  Karlin  and 
McGregor,  1968;  Sved,  1968b;  Ohta  and  Kimura,  1969;  Cavalli-Sforza  and 
Bodmer,  1971;  Prout,  1973;  Nei  and  Li,  1973).  At  any  rate,  random  mating 
populations  generally  do  not  have  the  strong  linkage  disequilibria  predicted 
by  Franklin  and  Lewontin. 

In  self-fertilizing  or  asexual  organisms,  however,  most  loci  are  expected 
to  be  in  linkage  disequilibrium,  since  in  these  organisms  the  unit  of  inheritance 
is  not  the  gene  but  the  genotype,  as  mentioned  earlier.  In  fact,  Clegg  et  al. 
(1972),  Allard  et  al.  (1972),  and  Hamrick  and  Allard  (1972)  found  strong 
linkage  disequilibria  in  predominantly  selfing  plants,  barley  and  wild  oats 
(Avena  barbata).  They  regarded  these  linkage  disequilibria  as  evidence  for 
coadaptation  of  genes.  However,  if  we  note  that  their  populations  are 
essentially  a collection  of  pure  lines  disregarding  the  effect  of  a small 
proportion  of  outcrossing,  their  data  can  also  be  explained  by  the  bottleneck 
effect  and  random  genetic  drift  that  occur  at  the  genotypic  level  when  seeds 
are  sampled  for  the  next  generation. 

Prakash  and  Lewontin  (1968)  found  strong  associations  between  inversion 
chromosomes  and  alleles  at  the  Pt-10  and  amylase  loci  in  chromosome  III 
of  Drosophila  pseudoobscura  and  D.  persimilis.  For  example,  gene  arrange- 
ment ST,  which  is  shared  by  both  species,  always  carries  allele  1.04  at  the 
Pt-10  locus,  while  gene  arrangement  SC  in  D.  pseudoobscura  mostly  carries 
allele  1.06.  They  claimed  that  these  strong  associations  are  evidence  for  the 
coadaptation  of  genes  in  inversion  chromosomes,  since  the  divergence  time 
between  D.  pseudoobscura  and  D.  persimilis  is  possibly  3 ~ 5 million  years. 
Tn  my  opinion  this  claim  is  not  warranted.  Since  there  is  no  (or  virtually 
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no)  recombination  between  different  gene  arrangements  in  these  species, 
the  monomorphism  of  the  Pt-10  locus  in  the  ST  gene  arrangement  can  also 
be  explained  without  selection  if  wc  assume  that  no  mutant  gene  has  spread 
through  the  gene  pool  of  ST  chromosomes  after  the  two  species  diverged. 
If  the  rate  of  gene  substitution  is  SO-7  per  locus  per  year  for  neutral  genes, 
then  the  probability  that  no  gene  substitution  has  occurred  in  the  Pt-10 
locus  of^ST  chromosomes  in  both  species  for  the  last  5 million  years  is 
e-ZK  10  x s « d o*  = Q3 14  approximately.  That  is,  even  if  the  divergence  time 
of  5 million  years  is  correct,  the  probability  is  quite  high.  Actually,  the  diver- 
gence time  between  the  two  species  seems  to  be  much  smaller  than  5 million 
years,  since,  as  will  be  seen  later  (section  7.3),  the  genetic  distance  between 
these  species  is  only  0.05.  This  may  correspond  to  a divergence  time  of  only 
about  250,000  years.  If  this  is  correct,  the  possibility  of  'neutral  monomor- 
phism’ is  very  high  even  if  there  is  some  amount  of  double  crossing  over 
between  inversion  chromosomes.  Similar  but  less  strong  associations  between 
inversion  chromosomes  and  isozyme  alleles  have  also  been  reported  by 
Kojima  et  al.  (1970)  and  Nair  and  Bmcic  (1971),  but  again  they  can  be 
explained  either  by  coadaptation  or  by  phylogenetic  resemblance.  It  is 
instructive  to  note  the  fact  that  the  amino  acid  sequences  of  the  human  and 
chimpanzee  hemoglobins  are  identical. 

It  is  now  clear  that  proof  of  overdominant  selection  or  any  other  type  of 
selection  by  means  of  gene  frequency  data  is  very  difficult.  One  might  think 
that  this  problem  can  be  solved  by  examining  genotype  fitnesses  directly. 
The  fitness  of  a genotype  can  be  measured  by  counting  the  total  number  of 
offspring  reaching  maturity.  In  practice,  however,  the  fitness  differences 
between  genotypes  are  generally  so  small,  that  an  enormous  number  of 
offspring  must  be  examined  to  detect  the  small  differences.  Yet  a small 
difference  in  fitness  is  very  important  in  the  population  dynamics  of  mutant 
genes. 

Genotype  fitness  can  also  be  measured  by  examining  the  long-term  changes 
of  gene  frequency  in  artificial  populations.  This  has  already  been  done  by 
several  authors  in  Drosophila.  The  results  obtained  are,  however,  quite 
inconsistent.  MacIntyre  and  Wright  (1966)  studied  the  change  in  frequency 
of  the  F allele  at  the  esterase  6 locus  in  cage  populations  of  D.  melanogaster, 
but  the  pattern  of  the  change  varied  considerably  with  genetic  background 
and  replication,  and  no  definite  conclusion  was  obtained  about  the  type  of 
selection.  Yarbrough  and  Kojima  (1967)  showed  that  the  frequency  of  the 
same  F allele  in  the  same  organism  reaches  an  apparent  stable  equilibrium 
in  about  30  generations.  The  pattern  of  the  gene  frequency  change  was 
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differentfrom  what  was  expected  under  overdominant  selection  but  appeared 
to  be  in  good  agreement  with  the  change  due  to  gene  frequency  dependent 
selection  (see  the  next  section).  Nevertheless,  there  was  a considerable 
variation  in  the  pattern  of  gene  frequency  among  replicate  cage  populations. 
In  the  experiment  by  Yamazaki  (1971)  with  an  allele  of  the  esterase  5 locus 
in  D.  pseudoobscura,  there  occurred  no  significant  changes  in  gene  frequency 
in  12  replicate  populations.  In  still  another  experiment  by  Ayala  (1972) 
in  D.  willistoni  gene  frequencies  converged  to  a supposedly  equilibrium  value 
at  two  loci  but  little  changes  in  gene  frequency  occurred  at  one  locus. 

One  serious  problem  in  this  type  of  experiment  is  that  any  gene  in  a 
genome  exists  linked  with  other  genes  and  the  effect  of  a gene  can  almost 
never  be  completely  isolated  from  those  of  others.  Particularly,  if  we  start 
cage  populations  from  a small  number  of  genomes  extracted  from  a natural 
population,  the  marker  gene  is  expected  to  show  seemingly  overdominant 
effects  in  early  generations  because  of  the  associative  overdominance  discussed 
earlier. 

T o understand  the  overdominant  effect  of  enzyme  variation  on  fitness,  it 
seems  to  be  important  to  know  the  biochemical  function  of  the  enzyme  in 
question  at  the  molecular  level.  If  we  know  this  function,  we  will  be  able 
to  study  the  effect  of  the  allelic  interaction  in  heterozygotes  on  fitness 
through  biochemical  pathways.  Recently,  Fincham  (1972)  proposed  the 
hypothesis  that  there  is  an  optimum  condition  for  an  enzyme  activity  with 
respect  to  allosteric  effectors,  and  a mutation  which  increases  the  enzyme 
activity  in  heterozygous  condition  may  overshoot  the  optimum  in  homo- 
zygous condition.  Latter  (1973b)  proposed  a more  general  optimum-model 
selection  in  which  various  biochemical  mutants  are  graded  on  a single  scale 
of  enzyme  activity  and  natural  selection  occurs  for  an  optimal  enzyme 
activity.  Using  this  model,  he  studied  the  expected  heterozygosity  in  a finite 
population  when  the  effects  of  mutation,  selection,  and  random  genetic  drift 
are  balanced.  The  expected  heterozygosity  was  lower  than  that  for  the  case 
of  purely  neutral  mutations.  Thus,  this  type  of  selection  decreases  rather 
than  increases  the  amount  of  genetic  variability  at  equilibrium. 

6.5.2  Other  types  of  balancing  selection 

At  the  esterase  6 locus  of  Drosophila  melanogaster  there  are  two  alleles, 
F and  S,  in  most  natural  populations.  In  studying  the  frequency  change  of 
the  F allele  in  cage  populations,  Yarbrough  and  Kojima  (1967)  noticed  that 
although  the  gene  frequency  converged  to  an  equilibrium  value,  starting 
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from  two  different  initial  frequencies,  the  pattern  of  the  change  was  con- 
siderably different  from  what  wits  expected  under  overdominant  selection. 
The  result  seemed  to  be  best  explained  by  gencfrcqucncydepcndent  selection. 
A direct  test  of  genotype  fitnesses  by  counting  progeny  numbers  also  sug- 
gested frequency  dependent  selection  and  at  the  gene  frequency  close  to  the 
equilibrium  value  the  three  genotypes  FF,  FS,  and  SS  showed  almost  equal 
fitness  (Kojima  and  Yarbrough,  1967).  A similar  result  was  also  obtained 
with  the  alcohol  dehydrogenase  locus  (Kojima  and  Tobari,  1969).  From 
these  observations,  Kojima  postulated  that  the  majority  of  enzyme  poly- 
morphisms are  maintained  by  frequency  dependent  selection  and  at  equilib- 
rium the  polymorphisms  are  load-free  because  all  genotypes  have  virtually 
the  same  fitness. 

There  are,  however,  some  difficulties  in  this  hypothesis.  First,  the  mecha- 
nism of  frequency  dependent  selection  is  not  well  known,  though  there  is 
some  evidence  that  it  is  caused  by  different  micro-niches  or  resources  required 
by  different  genotypes  (Huang  et  al.,  1971).  Second,  MacIntyre  and  Wright 
(1966),  using  the  same  pair  of  alleles  at  the  same  locus,  obtained  quite  a 
different  pattern  of  gene  frequency  change,  as  mentioned  earlier.  This 
suggests  that  the  gene  frequency  change  at  this  locus  is  very  sensitive  to 
environmental  conditions  and  genetic  backgrounds.  If  this  is  so,  it  is 
questionable  to  assume  that  the  results  obtained  in  laboratory  experi- 
ments directly  apply  to  natural  populations.  Third,  contrary  to  Kojima’ s 
belief,  the  polymorphism  due  to  frequency  dependent  selection  is  not  load- 
free,  as  was  shown  by  Kimura  and  Ohta  (1971b).  In  other  words,  there  must 
be  fertility  excess  for  frequency  dependent  selection  to  operate.  If  there  is 
no  fertility  excess,  selection  cannot  operate  when  gene  frequency  deviates 
from  the  equilibrium  value.  For  example,  in  the  model  of  frequency  de- 
pendent selection  discussed  in  ch.  4,  the  absolute  fertility  must  be  equal  to 
or  higher  than  1 + a or  1 - a + b,  whichever  is  higher.  Otherwise,  the 
required  frequency  dependent  selection  cannot  occur.  In  the  case  of  inversion 
polymorphism  discussed  earlier,  a and  b were  estimated  to  be  0.902  and 
1.288.  Therefore,  the  fertility  must  be  at  least  1.902  just  to  maintain  this 
polymorphism,  though  the  fitnesses  of  the  three  genotypes  at  equilibrium 
are  all  equal  to  1.  It  is  clear  that  if  there  are  a large  number  of  such  loci 
independently  functioning,  the  fertility  excess  required  is  tremendously  high. 
By  using  a somewhat  different  argument,  Kimura  and  Ohta  (1971b)  have 
provided  a method  to  compute  the  fertility  excess  required  in  this  case. 

In  addition  to  frequency  dependent  selection  there  are  several  other 
possible  mechanisms  by  which  the  enzyme  polymorphism  can  be  maintained. 
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As  mentioned  earlier,  heterogeneous  environments  may  maintain  stable 
polymorphism  under  certain  conditions.  In  fact,  Selander  and  Kaufman 
(1973a)  tried  to  explain  the  difference  in  average  heterozygosity  between 
vertebrates  and  invertebrates  (table  6.2)  by  assuming  that  the  environments 
for  invertebrates  are  generally  more  heterogeneous  than  those  for  verte- 
brates. Unfortunately,  however,  there  is  no  evidence  for  this  assumption. 
Furthermore,  this  type  of  selection  does  not  have  much  power  to  hold  poly- 
morphism-^ discussed  earlier. 

Several  authors  (e.g.  Kojima  et  al.,  1972;  Johnson  and  Schaffer,  1973) 
found  correlations  between  the  patterns  of  gene  frequency  variations  at 
enzyme  loci  and  ecological  or  environmental  factors  such  as  temperature, 
latitude,  and  altitude.  This  sort  of  correlation,  however,  can  always  be  ex- 
plained either  by  selection  or  by  neutral  mutations.  In  the  wild  oat  Avena 
barbata  Clegg  and  Allard  (1972)  and  Hamrick  and  Allard  (1972)  showed 
that  the  genotype  frequencies  for  certain  enzyme  loci  are  strongly  correlated 
with  the  humidity  of  the  environment.  Evidently,  certain  genotypes  are 
adapted  to  the  arid  environment,  while  others  are  adapted  to  the  humid 
environment.  However,  it  is  not  clear  whether  the  adaptation  is  due  to  the 
enzyme  loci  themselves  or  to  other  genes  associated  with  the  enzyme  poly- 
morphism, since  in  selfing  organisms  such  as  this  plant  strong  linkage  dis- 
equilibrium is  expected  to  occur. 

6.5.3  Neutral  mutations 

As  discussed  in  ch.  5,  a large  amount  of  genic  variation  may  be  maintained 
in  a population  without  any  selective  force  if  the  product  of  mutation  rate 
and  effective  population  size  is  sufficiently  large.  In  this  case  any  mutant  gene 
never  stays  in  the  population  forever,  but  since  new  mutations  are  always 
produced,  genic  variation  is  always  present.  Under  this  hypothesis,  therefore, 
gene  substitution  in  evolution  and  genetic  polymorphism  in  a population 
are  two  different  aspects  of  the  same  phenomenon,  as  emphasized  by  Kimura 
and  Ohta  (1971a).  To  my  knowledge,  this  hypothesis  was  first  put  forward  by 
Robertson  (1967)  and  Crow  (1968)  in  the  context  of  genetic  polymorphism 
and  more  forcefully  by  Kimura  (1968a)  in  the  context  of  gene  substitution. 
The  theoretical  basis  of  neutral  polymorphism  had  been,  however,  given  by 
Wright  (1931,  1932,  1948b)  and  Kimura  and  Crow  (1964),  who  studied  the 
gene  frequency  distribution  and  the  expected  number  of  neutral  alleles  per 
locus.  Also,  the  possibility  that  a large  fraction  of  mutant  genes  are  selectively 


Mechaithms  of  maintenance  of  protein  polymorphisms 


165 


neutral  had  been  discussed  by  Sueoka  (1962)  and  Freese  (1962)  from  the 
biochemical  point  of  view. 

We  have  already  discussed  the  mathematical  model  of  the  neutral  muta- 
tion hypothesis  in  ch.  5.  Let  us  summarize  the  essential  features  of  the  model 
with  the  aid  of  fig.  5.4.  1)  In  this  model  there  occur  on  the  average  2IVv 
mutations  at  a locus  every  generation  in  a population  of  size  N,  where  v is 
the  mutation  rate  at  a locus.  The  fate  of  each  mutant  allele  is  determined 
wholly  by  chance;  some  alleles  may  increase  in  frequency,  while  others 
may  be  eliminated  by  chance  from  the  population.  The  majority  of  the 
mutant  alleles  are  lost  in  early  generations,  and  only  one  out  of  2N  new 
mutant  alleles  will  eventually  be  fixed  in  the  population.  2)  The  time  required 
for  a successful  mutant  allele  to  be  fixed  is  4Ne  generations  on  the  average. 
Thus,  in  a large  population  gene  substitution  takes  a long  time,  during  which 
transient  polymorphism  necessarily  occurs.  For  example,  in  a population 
of  Ne  = 10,000,  the  fixation  time  is  40,000  generations,  which  will  be  800,000 
years  for  an  organism  with  a generation  time  of  20  years,  as  in  man.  This 
time  is  apparently  much  longer  than  the  time  required  for  racial  formation 
in  man.  3)  Transient  polymorphism  is  also  caused  by  unsuccessful  alleles 
which  reach  an  appreciable  gene  frequency  but  are  eventually  eliminated  by 
chance.  The  average  extinction  time  for  an  unsuccessful  mutant  allele  is 
generally  very  short.  At  the  steady  state  where  mutation  and  random  genetic 
drift  are  balanced,  the  expected  heterozygosity  or  gene  diversity  is  given  by 
H = 4Artu/(4Altu  + 1).  4)  At  the  steady  state  the  rate  of  gene  substitution 
is  equal  to  the  mutation  rate  (2 Nv  x (1/2 A)  = v).  5)  The  definition  of 
neutrality  depends  on  whether  the  frequency  change  of  the  gene  in  question 
is  entirely  or  almost  entirely  determined  by  random  genetic  drift  or  not. 
Thus,  a mutant  gene  which  is  selective  in  a large  population  may  become 
neutral  in  a relatively  small  population,  as  mentioned  earlier.  Also,  in  the 
presence  of  random  fluctuation  of  selection  intensity  in  different  generations 
a selective  gene  may  behave  just  like  a neutral  gene.  6)  The  neutral  mutation 
hypothesis  proposed  by  Kimura  is  a majority  rule  and  does  not  deny  the 
existence  of  deleterious  genes  causing  a small  amount  of  genetic  variability 
and  a small  proportion  of  advantageous  or  overdominant  genes.  In  fact, 
if  we  consider  only  fresh  mutations,  a majority  of  them  appear  to  be  dele- 
terious (Kimura  and  Ohta,  1973b).  Because  of  their  deleterious  effects,  how- 
ever, they  are  quickly  eliminated  from  the  population  and  contribute  little 
to  the  genetic  variability. 

Let  us  now  examine  the  above  hypothesis  by  using  the  available  data.  In 
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this  chapter,  however,  we  shall  consider  only  the  problems  related  to  genetic 
polymorphism,  deferring  the  evolutionary  aspect  to  ch.  8. 

We  have  seen  that  the  average  heterozygosity  in  human  populations  is 
about  10  percent.  Thus,  if  the  neutral  mutation  hypothesis  is  correct,  4 Nev 
must  be  approximately  0.1.  In  ch.  3,  we  estimated  the  rate  of  electrophoretic- 
ally  detectable  mutations  for  protein  loci  under  the  hypothesis  of  neutral 
mutation  to  be  10“ 7 per  year.  If  the  generation  time  in  the  past  was  20 
years,  the  mutation  rate  per  generation  becomes  2 x 10“ 6.  Therefore, 
in  order  to  get  4Nev  = 0.1,  Ne  must  be  about  13,000.  The  size  of  the  present 
human  population  is  much  larger  than  this  number,  but  the  effective  popula- 
tion size  in  the  early  process  of  human  evolution  might  have  been  quite  small. 
If  population  size  increases,  the  average  heterozygosity  is  expected  to  increase 
but  it  takes  a long  time  to  reach  the  new  steady  state  level  (ch.  5). 

There  is  reason  to  believe  that  the  above  estimate  of  Ne  is  an  under- 
estimate. In  the  above  procedure  we  have  implicitly  assumed  that  the 
mutation  rate  is  the  same  for  all  loci.  This  assumption  is  certainly  incorrect. 
When  M = 4Nv  varies  with  locus,  the  expectation  of  homozygosity  (/ ) is 
given  by 
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approximately,  where  M and  o2M  are  the  mean  and  variance  of  M,  respectively. 
Therefore,  the  expected  heterozygosity  is 
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Namely,  the  average  heterozygosity  for  a given  Ne  is  smaller  when  mutation 
rate  varies  than  when  it  is  constant.  Unfortunately,  we  do  not  know  the 
magnitude  of  cr*  at  the  present  time. 

At  any  rate,  the  level  of  gene  diversity  in  human  populations  is  not  terribly 
inconsistent  with  the  neutral  mutation  hypothesis.  As  discussed  earlier,  the 
average  gene  diversity  varies  with  organism,  but  the  magnitude  of  variation 
can  be  explained  by  the  differences  in  effective  population  size  and  sampling 
error  among  loci.  However,  this  kind  of  test  of  the  hypothesis  cannot  be  very 
rigorous,  since  the  effective  size  in  the  past  can  never  be  known  precisely. 

Recently,  Ayala  (1972)  showed  that  the  average  heterozygosity  in  Droso- 
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phila  wiMfettmi  is  0.177.  He  estimates  the  effective  population  size  of  this 
population  to  be  at  least  109.  If  the  mutation  rate  is  10“  7 per  locus  per  year 
and  there  arc  10  generations  in  a year,  then  the  expected  heterozygosity  at 
steady  state  becomes  0.976.  This  value  is  much  higher  than  the  observed 
value.  Because  of  this  discrepancy,  Ayala  believed  that  his  observation  cannot 
be  explained  by  the  neutral  mutation  hypothesis.  Ohta  and  Kimura  (1973) 
and  Nei  et  al.  (1975),  however,  tried  to  explain  the  discrepancy  by  the  sup- 
position that  the  population  size  has  increased  only  recently  and  the  gene 
diversity  has  not  reached  the  steady  state  value.  It  should  be  noted  that  it 
takes  about  107  years  for  the  steady  state  value  to  be  attained  approximately 
once  this  is  disturbed  (see  formula  (5.1 10a)).  Another  possible  factor  for  the 
relatively  small  heterozygosity  is  the  random  fluctuation  of  selection  in- 
tensity, which  would  reduce  genetic  variability  considerably  (Fisher  and 
Ford,  1947;  Wright,  1948a).  At  any  rate,  it  is  noted  that  in  order  to  explain 
Ayala's  data  some  mechanism  which  reduces  genetic  variability  must  be 
assumed;  balancing  selection  is  not  required. 

Ohta  and  Kimura  (1973)  noted  that  the  expected  gene  diversity  for  electro- 
phoretically  detectable  protein  loci  may  be  smaller  than  the  value  given  by 
4Nv/(l  + 4Nv)  even  if  4Nv  is  the  same  for  all  loci.  This  is  because  a charge 
change  of  a protein  that  was  induced  by  an  amino  acid  substitution  may  be 
cancelled  out  by  the  second  amino  acid  substitution  which  produces  an 
opposite  charge  change.  In  fact,  it  can  be  shown  that  the  expected  homo- 
zygosity under  this  circumstance  is  given  by 

J = 1/41+  (6.17) 

where  v is  the  rate  of  mutations  which  induce  electrophoretic  charge  changes. 
In  the  above  case  of  Ne  = 109  and  v = 10“ s,  H = 1 - J becomes  0.889. 
Thus,  the  expected  value  is  still  much  higher  than  the  observed,  and  this 
factor  alone  cannot  explain  the  discrepancy. 

Incidentally,  if  8 Nev  is  small  compared  with  1,  the  above  formula  for  J 
can  be  expressed  as 

J = 1/[1  + 4 Nev  - (8A»;/S  + ...] 
fB  1 / [ 1 + 4 Nev]. 

In  many  organisms  8 Nev  seems  to  be  about  0.3  or  less,  so  that  the  average  gene 
diversity  is  approximately  given  by  the  previous  formula  4Nev/(l  + 4 Nev). 
The  accuracy  of  this  formula  becomes  higher  if  the  tertiary  structure  of 
protein  affects  the  electrophoretic  mobility  or  if  heat  treatment  technique 
is  used  in  combination  with  electrophoresis. 
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Table  6.9 


Expected  ( VarQi ))  and  observed  (Vg(h))  variances  of  heterozygosity  among  loci  in  various 
organisms.  When  there  are  more  than  two  populations,  the  average  values  are  given. 


Organism 

No.  of 
populations 

Average 
no.  of 
loci 

H 

Var(h) 

vm 

D.  pseudoobscura a 

3 

24 

0.122 

0.03187 

0.04698 

D.  willistonib 

2 

25 

0.192 

0.04271 

0.03925 

Horseshoe  crab® 

4 

25 

0.061 

0.01818 

0.01569 

Anolis  carolinensis& 

4 

23 

0.051 

0.01522 

0.01671 

House  mousee 

4 

40 

0.085 

0.02396 

0.02449 

Thomomys  talpoides r 

10 

31 

0.056 

0.01638 

0.01736 

Man* 

3 

57 

0.096 

0.0266 

0.0269 

Source  of  data:  a Prakash  et  al.  (1969); 11  Ayala  and  Tracey  (7973); c Selander  et  al. 
(1970); 11  Webster  et  al.  (1972);  = Selander  et  al.  (1969); T Nevo  et  al.  (1974); B Nei  and 
Roychoudhury  (1974b). 


Nei  and  Roychoudhury  (1974a)  studied  whether  the  relationship  between 
the  mean  and  variance  of  heterozygosity  agrees  with  the  theoretical  expecta- 
tion under  neutral  mutations.  This  method  does  not  require  separate  esti- 
mates of  Ne  and  v.  Stewart  (1974)  and  Li  and  Nei  (1975)  have  shown  that 
in  a randomly  mating  population  the  steady  state  variance  of  population 
heterozygosity  at  individual  loci  under  the  hypothesis  of  neutral  mutations 
is  given  by 


Var{h} 


2 M 

(l  + M?(2  4-  MX3  + AJ)' 


while  the  mean  is  H = Mj(l  + M).  Therefore,  if  we  estimate  M from  the 
estimate  of  H,  we  can  compute  the  expected  variance  of  heterozygosity.  This 
expected  variance  can  be  compared  with  the  observed  variance  of  population 
heterozygosity  among  different  loci.  The  variance  (V(h))  of  observed  hetero- 
zygosities at  different  loci,  however,  includes  the  sampling  variance  ( Vs(h )) 
at  the  time  of  gene  frequency  survey,  and  this  must  be  subtracted.  The 
detailed  procedure  is  given  in  the  paper  by  Nei  and  Roychoudhury. 

The  expected  ( Var(h ))  and  observed  ( Vg(h))  variances  of  heterozygosity 
in  various  organisms  are  given  in  table  6.9.  In  this  table  only  those  organisms 
in  which  a relatively  large  number  of  loci  are  studied  are  included.  It  is  seen 
that  in  many  organisms  the  observed  value  agrees  quite  well  with  the  theo- 
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rclical,  though  the  former  tends  to  be  slightly  larger  than  the  latter.  The 
slightly  larger  values  of  Vjfi]  may  be  due  to  the  varying  mutation  rates 
among  different  loci.  Thus,  the  neutral  mutation  theory  fits  the  data.  Never- 
theless, the  agreement  between  Var(h)  and  is  not  proof  of  the  neutral 
mutation  hypothesis.  Some  combinations  of  different  types  of  selection  may 
well  produce  the  same  effect. 

Maruyama  (1972a)  and  Yamazaki  and  Maruyama  (1972,  1974)  provided 
a method  to  distinguish  between  neutral  and  overdominant  genes  by  using 
the  relationship  between  gene  frequency  and  heterozygosity.  As  shown  in 
ch.  5,  the  steady  state  distribution  of  neutral  genes  with  irreversible  muta- 
tions is  given  by  ^(x)  = 4 Nv/x  for  1/(2 A)  < x <.  1.  Therefore,  the  hetero- 
zygosity due  to  the  genes  whose  frequency  is  between  x and  x + dx  is 

h{x)dx  = 2_v<l  - x)tf>,(x)dK 

SJVuU  - (6.19) 

Namely,  if  we  compute  heterozygosity  for  each  allele  separately,  and  take 
the  sum  of  heterozygosities  for  the  alleles  whose  frequency  is  between  x 
and  x + dx,  then  it  is  given  by  the  above  formula.  Clearly,  the  heterozygosity 
/z(x)  decreases  as  x increases.  (Maruyama  used  A( Jf)/(2Afc)  = 4(1  - x)  rather 
than  h(x)  itself.) 

On  the  other  hand,  if  most  mutant  genes  (A,)  are  selectively  advantageous 
such  that  the  fitnesses  of  A2A2 , AlA2,  and  A,  A,  are  1,  I + s,  and  1 + 2s, 
then  the  gene  frequency  distribution  is  given  by  formula  (5.102).  If  4 Ns  is 
much  larger  than  1,  it  reduces  to  ^(x)  = 4Ay/[x(l  - x)]  approximately. 
Therefore,  we  have 

*(x)dx  = $Nvdx.  (6.20) 

Clearly,  heterozygosity  is  constant  irrespective  of  gene  frequency.  If  mutant 
genes  are  mostly  deleterious,  s in  (5.102)  should  be  replaced  by  - s,  and  we 
have 

(6-21) 

approximately.  If  a majority  of  mutant  genes  is  overdominant,  it  is  not 
easy  to  obtain  a simple  formula,  but  /?(x ) is  expected  to  have  a unimodal 
distribution  with  a peak  around  x = 0.5  (curve  (4)  in  fig.  6.2).  (Ayala  and 
Gilpin  (1973)  presented  alternative  distributions  for  overdominant  genes, 
but  their  distributions  are  unrealistic  since  they  ignored  the  effect  of  stochastic 
elements.)  Therefore,  if  we  study  the  relationship  between  h(x) dx  and  x,  we 
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Fig.  6.2.  Relationship  between  heterozygosity  and  gene  frequency.  The  curves  indicate 
the  theoretical  expectations:  (1)  neutral,  (2)  advantageous,  (3)  deleterious,  and  (4)  over- 
dominance. The  dots  indicate  the  observed  values.  From  Yamazaki  and  Maruyama  (1974), 
reprinted  by  permission,  The  American  Association  for  the  Advancement  of  Science, 
© 1974. 

can  make  some  inference  about  the  mechanism  of  maintenance  of  genetic 
polymorphism.  Maruyama  (1972a)  showed  that  the  above  formulae  hold 
irrespective  of  the  geographical  structure  of  the  population,  if  h(x ) is  defined 
as  the  average  heterozygosity  within  random  mating  subunits  of  the  popula- 
tion. 

In  practice,  there  are  some  difficulties  in  applying  the  above  theory.  First, 
the  above  formulae  are  given  as  a function  of  mutant  gene  frequency.  In 
reality,  however,  there  is  no  way  to  tell  which  allele  is  mutant  and  which 
allele  is  the  original  gene.  Yamazaki  and  Maruyama  avoided  this  problem  by 
folding  the  gene  frequency  class  around  0.5,  so  that  the  heterozygosity 
corresponding  to  gene  frequency  1 - x is  added  to  that  corresponding  to  x. 
The  new  ordinate  for  neutral  genes  is  therefore  8AT>(1  — x)  + 3M*{|  - 
( 1 - x))  = 8Nv  for  0 < „y  <;  0.5.  Namely,  this  procedure  makes  /j(.v)  to  be 
constant  irrespective  of  gene  frequency,  as  in  the  case  of  selectively  ad- 
vantageous mutations.  However,  it  is  still  possible  to  distinguish  the  case 
of  neutral  or  selectively  advantageous  genes  from  those  of  deleterious  and 
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overdominant  genes.  When  plotting  the  value  of  h(x)  against  x,  Yamazaki 
and  Maruyama  also  eliminated  one  allele  at  random  from  each  locus  to 
correct  the  bias  introduced  from  the  interdependence  of  allele  frequencies. 
Second,  the  formulae  for  , (x)  used  above  are  based  on  the  assumption  that 
each  mutation  is  unique  and  no  further  mutations  occur  in  the  population 
until  the  mutant  gene  is  fixed  or  lost.  This  assumption  is  satisfactory  if 
ANv  is  very  small  compared  with  I.  I-Iowever,  if  the  probability  of  mutation 
of  polymorphic  genes  is  high,  then  4>(x)  = M(1  - x)w  “ ] jr- J rather  than 
0j(x)  should  be  used  for  neutral  genes.  Therefore,  h(x)  is  proportional  to 
xM  + (1  - x)^  for  0 <,  x < 0.5  (Ewens  and  Feldman,  1974).  However,  this 
function  is  also  roughly  uniform  when  4 Nv  « I.  so  that  the  Maruyama- 
Yamazaki  test  seems  to  be  still  applicable.  The  forms  of  <P(x)  for  other  types 
of  genes  are  not  known. 

At  any  rate,  Yamazaki  and  Maruyama  applied  the  above  theory  to  gene 
frequency  data  for  1045  independent  alleles  at  protein  loci  from  various 
organisms.  Note  that  in  this  test  only  the  relative  value  of  h(x ) is  important, 
so  that  data  from  different  loci  in  different  organisms  can  be  pooled  together. 
The  results  obtained  are  given  in  fig.  6.2.  It  is  clear  that  the  relationship 
between  h(x)  and  x is  consistent  with  the  hypothesis  of  neutral  mutations  or 
selectively  advantageous  mutations.  Between  these  two  alternatives,  the 
neutral  mutation  hypothesis  is  more  appealing  because  it  is  unlikely  that 
most  new  mutants  are  more  fit  than  the  alleles  from  which  they  mutated 
(see  also  subsec.  6.5.4).  For  these  reasons,  Yamazaki  and  Maruyama 
regarded  their  result  as  evidence  favoring  the  neutral  mutation  hypothesis. 
Of  course,  their  conclusion  is  not  decisive,  since  the  rectangular  distribution 
of  h(x)  can  also  be  explained  by  an  appropriate  mixture  of  deleterious  and 
overdominant  loci.  Yamazaki  and  Maruyama  also  studied  the  distribution 
of  h(x ) for  human  blood  group  genes  and  obtained  a pattern  similar  to  that 
for  overdominance.  The  26  loci  they  used,  however,  clearly  deviated  from  a 
random  sample  of  the  genome  (cf.  sec.  6.3),  so  that  their  conclusion  is  not 
justified. 

There  are  several  other  methods  designed  to  test  the  neutral  mutation 
hypothesis.  Ewens  (1972)  proposed  a crude  method  of  testing  by  using  the 
sampling  theory  of  neutral  alleles.  This  method  is,  however,  very  sensitive 
to  deleterious  alleles.  If  any  of  these  alleles  are  included  in  the  sample,  the 
test  would  generally  indicate  nonneutrality  of  genes,  even  if  they  constitute  a 
minor  component  of  genetic  variability.  The  same  thing  can  be  said  about 
the  method  which  makes  use  of  the  relationship  between  the  actual  and 
effective  numbers  of  alleles  per  locus  (Johnson  and  Feldman,  1973;  Yamazaki 
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and  Maruyama,  1973),  though  this  method  is  less  sensitive  than  Ewens'. 
Recently,  Lewontin  and  Krakauer  (1973)  claimed  that  the  neutral  mutation 
theory  can  be  tested  by  examining  the  variation  of  Wright's  (1943,  1951) 
Fst  among  different  loci.  As  pointed  out  by  Nei  and  Maruyama  (1975), 
however,  their  method  does  not  appear  to  be  theoretically  justifiable. 

In  general,  it  seems  to  be  very  difficultto  draw  a definite  conclusion  about 
the  mechanism  of  maintenance  from  a study  of  gene  frequency  data  alone. 
At  the  present  time,  most  of  the  gene  frequency  data  available  can  be  ex- 
plained either  by  the  neutral  mutation  hypothesis  or  by  the  selection  hypo- 
thesis. There  are,  of  course,  some  data  on  specific  loci  which  are  hard  to  explain 
by  the  former  hypothesis,  but,  as  emphasized  earlier,  we  are  concerned  with 
the  majority  of  loci  rather  than  a few  exceptions.  To  arrive  at  a definite 
conclusion,  perhaps  we  must  observe  the  frequency  changes  of  many  genes 
in  natural  populations  for  a long  period  of  time.  Unfortunately,  the  genetic 
change  of  populations  is  a very  slow  process  compared  with  our  lifetime 
except  in  some  lower  organisms.  Another  approach  to  this  question  is  to 
study  the  amino  acid  sequences  of  typical  polymorphic  proteins.  If  this  is 
done  in  many  related  organisms,  we  will  know  the  proportion  of  the  alleles 
that  have  been  kept  in  the  population  for  a long  period  of  time  by  some  sort 
of  balancing  selection.  As  will  be  seen  in  the  next  chapter,  however,  data  on 
such  proteins  as  hemoglobin,  cytochrome  c,  fibrinopeptide,  etc.,  suggest 
that  gene  substitution  occurs  almost  continuously  and  thus  balancing 
selection  is  rare.  Data  on  gene  identity  between  closely  related  species  also 
support  this  conclusion. 

Still  another  approach  to  our  problem  is  to  study  the  biochemical  and 
physiological  properties  of  polymorphic  genes.  Some  studies  in  this  direction 
have  already  been  made.  As  mentioned  earlier,  the  heterozygotes  for  hemo- 
globin S in  man  have  a higher  fitness  than  the  hemoglobin  A homozygotes 
in  malarial  areas  because  of  a higher  resistance  to  malaria.  It  is  known  that 
hemoglobin  S produced  in  heterozygotes  forms  large  crystal  aggregates  under 
conditions  of  low  oxygen  tension.  This  appears  to  reduce  the  vigor  of  the 
malarial  parasite  Plasmodium  falciparum  in  the  A/S  sickler  environment, 
probably  because  the  parasite  which  apparently  derives  most  of  its  nutrition 
from  the  hemoglobin  in  the  red  blood  cells  cannot  digest  the  hemoglobin 
in  the  form  of  crystalline  aggregates.  Another  possible  explanation  for 
malarial  resistance  is  that  the  sickle  cells  formed  in  heterozygotes  arc 
phagocytized,  which  bring  about  the  preferential  removal  of  the  parasite 
(Motulsky,  1964).  This  example  is,  however,  very  special,  and  in  other  cases 
the  biochemical  and  physiological  mechanisms  are  largely  unknown. 
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At  the  red  cell  acid  phosphatase  locus  in  man,  there  are  three  major 
alleles.  Spencer  et  al.  (1964)  have  shown  that  the  level  of  acid  phosphatase 
activity  in  red  cells  of  one  homozygote  (BB)  is  about  50%  greater  than  in 
another  homozygote  (AA)  and  the  heterozygote  (AB)  shows  an  intermediate 
level.  Harris  (1971)  reports  that  significant  biochemical  differences  between 
alleles  have  been  observed  at  16  out  of  the  23  enzyme  loci  so  far  studied. 
Similar  differences  in  enzyme  activity  have  been  reported  at  the  alcohol 
dehydrogenase  locus  in  Drosophila  metanogaster  (Gibson,  1970;  Vigue  and 
Johnson,  1973;  Day  et  al.,  1974).  It  is  probable  that  these  differences  in 
enzyme  activity  are  reflected  in  some  physiological  or  morphological 
characters.  Yet,  it  is  not  proof  of  the  nonneutrality  of  genes  in  population 
dynamics.  As  will  be  discussed  later  (ch.  8),  at  least  some  proportion  of  the 
genetic  variation  in  morphological  characters  seems  to  be  almost  neutral. 
In  fact,  there  are  no  obvious  differences  in  health  and  viability  between 
different  genotypes  for  red  cell  acid  phosphatase  in  man.  Clearly,  a more 
careful  study  on  the  whole  process  of  gene  function  should  be  made. 

6.5.4  Transient  polymorphism  due  to  selection 

In  the  Maruyama-Yamazaki  test  of  neutral  mutations  selectively  advan- 
tageous genes  cannot  be  distinguished  from  neutral  genes.  Maruyama  (1 972b), 
however,  argues  that  the  contribution  of  advantageous  genes  to  heterozygosity 
is  likely  to  be  small  compared  with  that  due  to  neutral  mutations.  We  have 
seen  that  h(x)  = 87Va(l  - x)  for  neutral  genes  and  h(x ) = 8AY  for  advan- 
tageous genes  (genic  selection).  Therefore,  for  a fixed  mutation  rate,  v,  the 
total  contribution  is  Jq  h(x)dx  = 4Nv  for  the  former  and  8 Nv  for  the  latter. 
Now  let  P and  1 - P be  the  relative  amounts  of  heterozygosity  due  to  neutral 
and  advantageous  genes,  respectively.  Then,  the  relative  mutation  rates  of 
neutral  and  advantageous  genes  are  P and  (1  - P)/ 2.  We  know  that  the  rate 
of  gene  substitution  is  v for  neutral  genes  and  4 Nsv  for  advantageous  genes 
for  a given  mutation  rate  (ch.  5).  Since  the  relative  mutation  rates  for  the 
two  classes  of  genes  are  P and  (1  - P)l 2,  the  ratio  of  neutral  gene  substitu- 
tions (m,)  to  selective  gene  substitutions  (aj  is  a n/as  = P/{2Ns(l  - P)).  Thus, 

p=,'(1  + lik)- 

This  indicates  that  even  if  the  proportion  of  neutral  gene  substitutions  is 
small,  say  5 percent,  a majority  of  polymorphisms  is  still  due  to  neutral 
mutations  if  Ns  > 10. 
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The  unimportance  of  transient  polymorphism  due  to  advantageous  genes 
can  also  be  seen  in  the  following  way.  In  eh.  5 we  have  seen  that  for  advan- 
tageous genes  the  average  number  of  heterozygous  codons  per  locus  at 
steady  state  is  H(\/2N)  = 8 AT  (5.98)  and  the  rate  of  gene  substitution  per 
generation  is  a = 4Nsv.  On  the  other  hand,  we  have  estimated  that  the  rate 
of  gene  substitution  per  locus  per  year  (a,)  is  10“ 7 for  electrophoretically 
detectable  proteins  (eh.  3).  Therefore,  if  the  majority  of  gene  substitutions 
occur  by  selection,  the  average  number  of  heterozygous  codons  per  locus  is 
expected  to  be//(l/2A0  = 2 tgay/s  = 2{tjs)  x 10“  Ta  where  tg  is  the  generation 
time  in  years.  In  man  tg  was  probably  about  20  in  the  past.  Thus,  if  s = 0.1, 
then  H(\j2N)  = 4 x 10_s1  which  is  much  smaller  than  the  observed  value 
(0.10  ~ 0.13;  table  6.1).  In  many  Drosophila  species  tg  is  probably  0.1, 
so  that  H(\/2N)  becomes  2 x 10“  7.  This  is  again  very  small  compared  with 
the  observed  value  (0.17  ~ 0.27  from  table  6.2).  Clearly,  the  hypothesis  of 
selective  transient  polymorphism  cannot  explain  all  the  variation  in  natural 
populations. 
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CHAPTER  7 


Differentiation  of  populations 
and  speciation 


If  two  populations  are  isolated  from  each  other  for  geographic  or  re- 
productive reasons,  the  two  populations  tend  to  accumulate  different  genes. 
This  differentiation  of  genes  may  occur  through  three  different  factors,  i.e., 
mutation,  selection,  and  random  genetic  drift.  If  the  effective  sizes  of  two 
populations  are  given,  it  is  not  difficult  to  formulate  the  effects  of  mutation 
and  genetic  drift  on  the  average  gene  differences  per  locus  between  the  two 
populations  (ch.  5).  The  effect  of  selection  varies  considerably  with  the  genes 
concerned  and  the  environments  in  which  the  two  populations  are  located, 
so  that  a general  formulation  is  not  easy.  However,  if  we  use  a proper 
measure  of  gene  differences  and  make  certain  assumptions,  a simple  formula 
may  still  be  obtained. 

In  this  chapter  we  shall  first  discuss  a statistical  method  by  which  the  gene 
differences  between  two  populations  can  be  measured  and  then  examine 
actual  data  available  in  relation  to  speciation.  We  shall  also  discuss  the 
mechanisms  of  speciation  briefly. 


7 A Measures  of  genetic  distance 

Genetic  distance  is  the  genetic  difference  between  populations  as  expressed 
by  a function  of  gene  frequencies.  In  recent  years  several  authors  (e.g., 
Sanghvi,  1953;  Cavalli-Sforza  and  Edwards,  1967;  Balakrishnan  and 
Sanghvi,  1968;  Hedrick,  1971;  Rogers,  1972)  proposed  different  measures  of 
genetic  distance.  In  many  of  them,  however,  it  is  not  clear  what  biological 
unit  they  are  going  to  measure.  (I  (Nei,  1973a)  have  discussed  the  advantages 
and  disadvantages  of  these  measures  extensively.)  From  the  standpoint  of 
genetics,  the  most  appropriate  measure  of  genetic  distance  would  be  the 
number  of  nucleotide  or  codon  differences  per  unit  length  of  DNA.  Theoreti- 
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cally,  it  is  possible  to  determine  the  number  of  nucleotide  differences  by 
biochemical  techniques.  At  the  present  time,  however,  sequencing  of  nucleo- 
tides is  very  expensive  and  time  consuming  even  for  a short  length  of  DNA. 
T o determine  the  average  number  of  nucleotide  differences  per  unit  length  of 
DNA,  a reasonably  large  portion  of  the  total  DNA  must  be  examined.  DNA 
hybridization  techniques  now  available  are  too  crude  to  be  used  for  detecting 
a small  number  of  nucleotide  differences  that  would  occur  among  local 
populations  within  a species. 

In  view  of  this  circumstance  I (Nei  1971a,  1972,  1973a)  developed  a 
statistical  method  by  which  the  average  number  of  codon  differences  per 
locus  can  be  estimated  from  gene  frequency  data.  Theoretically,  this  method 
can  be  applied  to  any  pair  of  taxa,  whether  they  are  local  populations, 
species,  or  genera,  if  enough  data  are  available.  Of  course,  the  current 
techniques  of  studying  gene  frequencies,  such  as  electrophoresis  and  immuno- 
logical reaction,  cannot  detect  all  codon  differences,  so  that  we  are  forced 
to  deal  with  only  those  codon  differences  that  are  detectable  by  the  current 
techniques,  though  some  correction  for  undetectable  codons  can  be  made 
under  certain  circumstances.  Th  addition  to  this,  there  are  some  other 
statistical  problems  which  make  it  difficult  to  estimate  the  exact  number  of 
codon  differences.  For  these  reasons,  1 have  proposed  three  different 
measures  of  genetic  distance,  i.e.,  the  minimum,  standard,  and  maximum 
estimates  of  codon  differences  per  locus.  All  these  estimates  refer  to  the 
codon  differences  that  are  detectable  by  the  techniques  used. 

Consider  two  populations,  X and  Y,  in  which  multiple  alleles  are  segre- 
gating at  a locus.  Let  xt  and  y(  be  the  frequencies  of  the  i-th  alleles  in  X and 
Y,  respectively.  The  probability  of  identity  of  two  randomly  chosen  genes  is 
Jx  = Z**  in  population  X,  while  it  isyV  = Z-W  population  Y.  The  prob- 
ability of  identity  of  two  genes,  chosen  at  random,  one  from  each  of  the 
two  populations,  is  jXY  = Z*J-^J’  N°te  that  the  identity  of  genes  defined  in 
this  way  is  the  observed  one  and  requires  no  assumptions  about  selection, 
mutation,  and  migration.  We  designate  by  Jx,  Jy,  and  JXY  the  arithmetic 
means  of  jx,jy,  and  jXY  over  all  loci,  including  monomorphic  ones,  respec- 
tively. Clearly,  = \ - J x,  Dr(m>  = I - JY,  and  D*y(n0  = 1 - JXY 

are  equal  to  the  proportion  of  different  genes  between  two  randomly  chosen 
genomes  from  the  respective  populations. 

As  discussed  in  ch.  6,  Dj;(hij  and  Dycm>  are  minimum  estimates  of  codon 
differences  between  two  randomly  chosen  genomes  from  populations  X and 
Y,  respectively.  On  the  other  hand,  Dxribti  is  a minimum  estimate  of  codon 


Measures  of  genetic  distance 


111 


differences  per  locus  between  two  randomly  chosen  genomes,  one  from  each 
of  X and  Y.  Therefore, 

— (7,1) 

may  be  regarded  as  a minimum  estimate  of  net  codon  differences  per  locus 
between  X and  Y when  intrapopulational  codon  differences  are  subtracted. 
We  call  Dm  the  minimum  genetic  distance.  It  is  noted  that  this  distance  is 
identical  to  the  interpopulational  gene  diversity  Dm  in  (6.12)  when  there  are 
only  two  populations. 

The  drawback  of  Dm  is  that  and  &X  r(mj  are  the  proportions 

of  different  genes  between  two  randomly  chosen  genomes,  so  that  their 
variation  is  not  additive.  Thus,  7>MI  may  be  a gross  underestimate  of  the 
number  of  net  codon  differences  when  is  large.  If  individual  codon 

changes  are  independent,  the  mean  number  of  net  codon  differences  may 
be  given  by 

D = - loge/,  (7-2) 

where 

1 = Jirti/J*!  r (7.3) 

is  the  normalized  identity  of  genes  between  X and  Y.  We  call  D the  standard 
genetic  distance.  It  is  noted  that  D can  be  written  as 

0 =■  Dxr  - (Dx  + Dr)/2,  (7.4) 

where  DXj  = - log,  JXY,  Dx  = - log,  Jx,  and  DY  = - log,  JY.  If  we  note  that 
Dx,  Dy,  and  DXY  are  estimates  of  codon  differences  per  locus  (6.2),  it  is  clear 
that  D is  a quantity  equivalent  to  (7.1).  Theoretically,  the  normalized  identity 
of  genes  between  X and  Y can  also  be  defined  as  / = 2JXY/(JX  + JY ) instead 
of  (7.3),  but  this  definition  does  not  permit  the  nice  biological  interpretation 
mentioned  above. 

As  will  be  shown  later,  if  the  rate  of  gene  (codon)  substitution  per  year  is 
constant,  D is  linearly  related  to  the  time  after  divergence  of  two  populations. 
Also,  under  certain  migration  models  D is  linearly  related  to  geographical  dis- 
tance or  area  (Nei,  1972).  Recently,  Latter  (1972)  proposed  a measure  of  genet- 
ic divergence,  y.  This  quantity  is  nearly  equal  to  1 - I unless  Jx  and  JY  are 
quite  different.  Therefore,  when  y is  small  compared  with  unity,  it  is  approxi- 
mately equal  to  Dm  or  D. 

If  the  rate  of  codon  changes  varies  from  locus  to  locus,  D still  may  be  an 
underestimate  of  codon  differences.  In  this  case  the  mean  number  of  net 
codon  differences  may  be  estimated  by 
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D'=~  (7-5) 

where  I'  = J'XYI^J(JXJY )>  in  which  J'XY,  Jx,  and  J'Y , are  the  geometric  means 
of  jXY,jx,  and  /y,  respectively,  over  different  loci.  It  is  clear  that  D'  permits 
an  interpretation  similar  to  (7.1)  and  (7.4)  when  codon  difference;  are 
estimated  by  (6.3).  In  practice,  however,  D'  is  affected  to  a considerable 
extent  by  sampling  errors  of  gene  frequencies  at  the  time  of  population 
survey  as  well  as  by  random  genetic  drift.  These  factors  are  expected  generally 
to  inflate  the  estimate  of  the  mean  number  of  net  codon  differences.  There- 
fore, I call  D'  the  maximum  genetic  distance.  If  any  of  the  values  of  jXYf 
\f(jXjy)  f°r  individual  loci  is  small,  D ' can  be  a gross  overestimate.  In  Fact, 
if  there  is  a single  locus  at  which  there  is  no  common  allele  between  two 
populations,  D'  is  infinitely  large.  Therefore,  I propose  that  for  general 
purposes  D rather  than  D'  be  used.  D can  be  used  for  studying  genetic 
distance  both  between  and  within  species. 

Nevertheless,  there  is  not  much  difference  between  Dm,  D,  and  D'  when 
local  populations  within  a species  are  compared.  In  this  case,  therefore,  any 
of  them  can  be  used.  In  most  practical  cases  Dm  < D < D'  but  this  relation 
does  not  necessarily  hold  when  the  values  of  these  quantities  are  extremely 
small.  In  such  a case,  however,  these  values  are  so  small,  that  they  are  almost 
always  within  their  standard  errors.  The  standard  errors  of  these  genetic 
distances  can  be  obtained  by  the  method  given  by  Nei  and  Roychoudhury 
(1974a).  The  variances  of  Dm  and  D due  to  random  genetic  drift  have  been 
studied  by  Li  and  Nei  (1975). 

So  far  we  have  defined  our  genetic  distance  measures  as  estimates  of 
codon  differences  per  locus,  so  that  a large  number  of  loci  are  to  be  examined. 
However,  collection  of  gene  frequency  data  is  time-consuming,  and  under 
certain  circumstances  only  a few  loci  may  be  available  for  the  study  of  gene 
differences.  In  this  case  the  estimate  of  genetic  distance  may  deviate  consider- 
ably from  the  real  value.  When  local  populations  within  the  same  species 
are  compared,  this  deviation  is  expected  to  be  generally  upward,  since  gene 
frequencies  are  studied  more  often  with  highly  polymorphic  loci  than  with 
less  polymorphic  loci,  and  monomorphic  loci  in  these  populations  almost 
always  have  the  same  allele.  However,  if  one  is  interested  only  in  relative 
values  of  genetic  distance  among  several  populations,  the  estimate  of  distance 
based  on  a few  polymorphic  loci  would  still  be  useful.  As  relative  distances, 
D„„  D,  and  D'  can  be  used  for  any  case  because  they  depend  on  no  assump- 
tions about  the  evolutionary  forces. 
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7.2  Gene  differentiation  among  populations:  a general  theory 
7.2  J Coinph’H’  ixalatUm 

We  have  shown  that  the  normalized  identity  of  genes  between  two  isolated 
populations  is  given  by  / = exp(  — 2vl)  under  mutation  pressure  (5.114). 
Let  us  now  show  that  if  we  make  certain  assumptions  essentially  the  same 
formula  holds  even  when  there  is  selection.  The  assumptions  we  make  are 
as  follows:  1)  Populations  X and  Y are  in  equilibrium  with  respect  to  the 
effects  of  mutation,  selection,  and  random  genetic  drift,  so  that  the  average 
gene  identities  ( Jx  and  J Y)  within  populations  remain  constant.  This  assump- 
tion seems  to  be  satisfactory  in  most  natural  populations,  since  closely 
related  populations  or  species  generally  show  the  same  degree  of  hetero- 
zygosity. 2)  The  rate  of  gene  substitution  per  locus  per  year  (a)  remains 
constant.  This  assumption  also  seems  to  be  roughly  correct  (ch.  8).  In  ch.  5 
we  have  seen  that  a is  equal  to  the  mutation  rate  per  year  (v)  if  all  mutations 
are  neutral  (5.43),  while  it  is  equal  to  4 Nsv  if  mutant  genes  are  advantageous 
and  semidominant  (5.45). 

Under  these  assumptions,  the  expectation  of  jXY  in  the  t-th  year  after 
reproductive  isolation  is  given  by 

jy,  = jft’d  - £>*)■(  I - <r,)r 

s J jcre  i 

where  a,  and  a,  are  the  values  of  a for  populations 
In  the  following  we  denote  the  average  of  a,  and  a, 
and  J(y}  = the  normalized  identity  of  genes  is 

I = ww 

= (7.1) 

approximately,  where  I,  = ^ is  expected  to  be  close  to 

one  in  most  cases,  since  no  appreciable  gene  differentiation  occurs  as  long 
as  there  is  migration  between  the  two  populations  (7.14).  Therefore,  we  have 

D ^ (7..S) 

Tt  is  clear  that  D measures  the  accumulated  number  of  gene  (codon)  sub- 
stitutions per  locus  between  the  two  populations. 

When  a varies  with  locus,  D'  may  be  a better  estimate  of  the  number  of 


X and  Y,  respectively, 
by  a.  Since  Jf  = 
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gene  substitutions  than  D.  Since  the  natural  logarithm  of  at 

thej-th  group  of  loci  is  — 2a  f,  where  ccj  is  the  value  of  a at  this  group  of  loci 
and  I,  is  assumed  to  be  one,  D'  can  be  written  as 

D*  ^ Hot i 4-  4-  +■  aF)tfr 

= 2 xj,  (7.9) 

where  a,„is  the  average  value  of  a;-  and  r is  the  number  of  different  groups  of 
loci.  In  practice,  however,  this  estimate  is  subject  to  a large  sampling  error,  as 
mentioned  earlier. 

There  is  another  way  to  correct  for  the  effect  of  varying  a.  If  we  know  the 
variance  of  a or  of  2a /,  then  the  genetic  distance  can  be  computed  by 

D = — + c|J2)]  (7.10) 

approximately,  where  D = 2Zt  and  o^ar  are  the  mean  and  variance  of  2at 
(Nei,  1971a).  In  general,  however,  we  do  not  know  the  value  of  cr^r  For- 
tunately,  numerical  computations  have  shown  that  (7.8)  is  quite  robust  and 
applicable  even  if  a varies  considerably  among  loci  (Nei  and  Chakraborty, 
unpublished). 

In  ch.  2 we  applied  the  Poisson  process  to  describe  the  evolutionary 
change  of  proteins,  neglecting  the  process  of  fixation  of  genes  in  populations. 
We  have  shown  that  the  probability  of  no  amino  acid  substitutions  occurring 
at  a particular  site  for  a period  oft  years  is  given  by  P0(t)  = e~A'.  Therefore, 
the  probability  that  two  homologous  polypeptides  of  n amino  acids  in 
related  taxa  have  undergone  no  amino  acid  substitution  during  t years  is 

PpO)  = e"a,4#.  (7.11) 

This  formula  is  identical  to  (7.7),  since  a = nX  if  all  amino  acid  differences 
are  detectable  by  the  technique  used. 

The  differentiation  of  genes  between  populations  is  generally  a slow 
process.  Two  closely  related  species  often  have  many  common  genes.  For 
example,  the  amino  acid  sequences  of  hemoglobin  a-  and  /1-chains  in  chim- 
panzee are  identical  with  those  in  man.  Therefore,  in  order  to  have  a reliable 
estimate  of  D a large  number  of  genes  must  be  examined.  A most  reliable 
method  of  detecting  gene  differences  between  closely  related  taxa  is  to 
sequence  amino  acids  of  the  proteins  produced.  At  present,  however,  this 
method  cannot  be  used  for  many  proteins,  as  mentioned  earlier.  A more 
rapid  and  efficient  method  is  to  use  electrophoresis  (Hubby  and  Throck- 
morton, 1965).  In  fact,  most  studies  on  gene  differences  between  closely 
related  taxa  have  been  done  by  using  this  technique. 
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As  noted  earlier,  electrophoresis  detects  only  a portion  of  amino  acid 
differences  (1/4  ~ 1/3).  If  c is  the  proportion  of  amino  acid  differences  that 
are  detectable  by  electrophoresis,  then  the  electrophoretic  identity  of 
proteins  between  two  taxa  may  be  written  as 

/ = e (7,12) 

approximately.  Namely,  a = cnX  in  this  case.  Therefore,  the  number  of 
elcctrophoretically  detectable  codon  differences  per  locus  can  be  estimated 
by  D = - logc7.  The  actual  number  of  codon  differences  (2 nlf)  is  then 
estimated  by  Die. 

Strictly  speaking,  (7.12)  does  not  hold  when  2cnXt  is  large,  say  more  than 
1 , since  the  detectability  of  protein  differences  by  electrophoresis  is  expected 
to  decline  gradually  as  the  time  after  divergence  increases.  This  is  because  a 
difference  in  the  net  charge  of  a protein  between  two  taxa,  which  is  induced 
by  a certain  amino  acid  substitution  in  one  of  the  two  species,  may  be 
cancelled  out  by  a second  amino  acid  substitution  occurring  in  the  same 
species  or  the  other.  Nei  and  Chakraborty  (1973)  (see  also  J.  L.  King,  1973) 
studied  this  problem  and  showed  that  (7.12)  is  applicable  if  2nXt  < 2 but 
it  can  be  a serious  underestimate  if  2 nXt  is  large.  Therefore,  when  D is  large, 
say  more  than  1,  ( — loge7)/c  should  be  regarded  as  an  underestimate  of 
InXt.  If  the  heat  denaturation  technique  mentioned  in  ch.  3 is  used  in  addi- 
tion to  electrophoresis,  c can  be  as  large  as  0.5  ~ 0.7.  In  this  case  the 
relationship  D = 2cnXt  holds  for  a larger  value  of  D (Maruyama,  un- 
published). Note  also  that  the  variation  of  a among  loci  also  results  in  an 
underestimate. 

At  any  rate,  if  we  know  a = cnX,  an  approximate  time  after  divergence 
between  two  taxa  may  be  estimated  by 

l = DK2a).  (7.13) 

Our  current  estimate  of  a is  very  crude,  so  that  the  above  method  gives  only 
a rough  estimate  of  divergence  time.  However,  in  organisms  where  no  fossil 
records  are  available,  even  such  an  estimate  seems  to  be  very  valuable. 

In  the  study  of  evolution  it  is  often  required  to  make  a phylogenetic  tree 
among  a number  of  related  species  without  any  particular  interest  in  knowing 
the  absolute  evolutionary  time.  This  can  easily  be  done  by  using  genetic 
distance  D,  since  this  is  proportional  to  the  divergence  time  as  long  as  D 
is  not  very  large.  In  this  case  no  knowledge  about  c,  n,  and  X is  required. 
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7.2.2  Effects  of  migration 

In  the  early  stage  of  population  differentiation  gene  migration  usually  occurs 
between  populations.  Migration  retards  gene  differentiation  considerably, 
and  even  a small  amount  of  migration  is  sufficient  to  prevent  any  appreciable 
gene  differentiation.  The  effects  of  migration  on  genetic  distance  have  been 
studied  by  Nei  and  Feldman  (1972)  and  Chakraborty  and  Nei  (1974)  under 
the  assumption  of  no  selection.  Their  main  conclusions  are  as  follows: 
1)  If  there  is  a constant  rate  of  migration  in  every  generation,  the  normalized 
identity  of  genes  (I)at  steady  state  is  given  by 

/ = (ffli  + m2)f{.n ix  4-  fflj  -f  2d)  (7.14) 

approximately,  if  2 v « mt  + m2  « 1.  Here,  v is  the  mutation  rate  per  locus 
per  generation  and  mt  and  m2  stand  for  the  migration  rates  between  two 
populations  (m,  and  m2  may  not  be  the  same  if  the  sizes  of  the  two  popula- 
tions are  not  equal).  2 ) The  approach  to  the  steady  state  is  generally  very 
slow;  the  number  of  generations  required  is  of  the  order  of  the  reciprocal 
of  mutation  rate.  Formula  (7.14)  indicates  that  the  genetic  distance  between 
populations  cannot  be  large  unless  migration  rates  are  very  small. 


7.3  Interracial  and  interspecific  gene  differences 

Let  us  now  examine  the  magnitude  of  interracial  and  interspecific  gene 
differences  in  various  organisms  so  far  studied.  Table  7.1  shows  the  minimum, 


Table  7.1 

Minimum,  standard,  and  maximum  genetic  distances  (estimates  of  the  number  of  net 
codon  differences  per  locus)  between  Caucasoid  and  Negroid*  populations  in  man.  These 
genetic  distances  are  based  on  gene  frequency  data  for  62  loci  and  refer  to  the  codon 
differences  that  are  detectable  by  electrophoresis.  From  Nei  and  Roychoudhury  (1974b). 


DC 

Dn 

Dcn 

Genetic 

distance 

Minimum 

0.104 

0.092 

0.108 

0.010  ± 0.003 

Standard 

0.110 

0.097 

0.114 

0.011  ± 0.004 

Maximum 

0.137 

0.115 

0.140 

0.014  ± 0.006 

A majority  of  data  (42  out  of  the  62  loci  used)  were  taken  from  American  Negroids. 
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standard,  and  maximum  estimates  of  the  number  of  net  codon  differences 
per  locus  between  Caucasoid  and  Negroid  (mostly  American)  populations. 
Dc  and  DN  refer  to  the  estimates  of  codon  differences  between  two  randomly 
chosen  genomes  from  Caucasoid  and  Negroid  populations,  respectively, 
while  Dcn  refers  to  the  same  estimate  between  two  genomes,  one  from 
Caucasoids  and  the  other  from  Negroids.  These  estimates  are  based  on  gene 
frequency  data  for  62  protein  loci.  It  is  seen  that  the  net  codon  differences 
detectable  by  electrophoresis  are  only  about  0.01  per  locus  and  there  is  not 
much  difference  among  the  minimum,  standard,  and  maximum  estimates. 
If  only  one  quarter  of  codon  differences  can  be  detected  by  electrophoresis, 
the  real  number  of  codon  differences  per  locus  is  estimated  to  be  0.04.  On 
the  other  hand,  the  estimates  of  codon  differences  between  two  randomly 
chosen  genomes  within  the  same  race  ( Dc  and  DN ) are  much  larger  than  the 
net  codon  differences.  Namely,  the  ratio  [/?sr  in  (6.13)]  ofDto(D,  + DN)jl 
is  only  10  percent.  This  indicates  that  the  interracial  genic  variation  in  man 
is  rather  small  compared  with  the  intraracial  variation,  and  the  genes  in 
Caucasoids  and  Negroids  as  well  as  in  Mongoloids  are  remarkably  similar 
(Nei  and  Roychoudhury,  1972).  This  is  in  sharp  contrast  to  the  conspicuous 
phenotypic  differences  observed  in  some  morphological  characters  such  as 
pigmentation  and  facial  structure.  It  is  likely  that  the  genes  controlling  these 


Fig.  7.1.  Frequency  distributions  of  single-locus  genetic  distance  between  Caucasoids  and 
Negroids  for  protein  and  blood  group  loci.  From  Nei  and  Roychoudhury  (1974b). 
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Table  7.2 

Estimates  of  genetic  distance  between  taxa  of  various  rank. 


Taxa 

No. 

of 

taxa 

No. 

of 

loci 

D = - 

Source 

A.  Local  races 
Man 

3 

35 

0.011  /v 

0.019 

Nei  and  Roychoudhury  (1974b) 

Mice  ( M.  musculus) 

4 

41 

0.010  ~ 

0.024 

Selanderet  al.  (1969) 

Horseshoe  crab 

4 

25 

0.001  -«v 

0.013 

Selander  et  al.  (1970) 

( L.  polyphemus) 
Kangaroo  rats 

9 

18 

0.000 

0.058 

Johnson  and  Selander  (1971) 

(D.  ordii) 
Lizards 

3 

23 

0.001  ~ 

0.017 

Webster  et  al.  (1972) 

(A.  carolinensis ) 
Astyanax  mexicanus 

6 

17 

0.002  ~ 

0.013 

Avise  and  Selander  (1972) 

Surface  fish 
Drosophila 
pseudoobscura 

3 

24 

0.003  ~ 

0.010 

Prakash  et  al.  (1969) 

willistoni 

9 

11 

0.001  ~ 

0.008 

Ayala  et  al.  (1972) 

B.  Subspecies 
Mice 

2 

41 

0.194 

Selander  et  al.  (1969) 

Pocket  gophers 

10 

31 

0.004  ~ 

0.262 

Nevo  et  al.  (1974) 

(T.  talpoides)* 
Gophers  (T.  bottae) 

4 

27 

0.009  ~ 

0.054 

Patton  et  al.  (1972) 

Lizards 

4 

23 

0.335  - 

0.351 

Webster  et  al.  (1972) 

(A.  carolinensis ) 
U.S.  mainland 
vs.  Bimini  Island 
Newts 

2 

18 

0.164 

Hedgecock  and  Ayala  (1974) 

(T.  torosa) 
Astyanax 

9 

17 

0.062  - 

0.218 

Avise  and  Selander  (1972) 

mexicanus* 


Cave  vs. 
Surface  fish 
Drosophila 


paulistorum 

4 

12 

0.028  ~ 0.234 

Richmond  (1972a) 

willistoni 

2 

25 

0.201 

Ayala  and  Tracey  (1973) 

pseudoobscura 

5 

24 

0.083  ~ 0.126 

Prakash  et  al.  (1969) 

Bogota  vs. 

U.S.  population 
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Table  7.2  (continued) 


Taxa 

No. 

of 

taxa 

No. 

of 

loci 

D = 

toiui 

Source 

C.  Species 

Kangaroo  rats 

2 

18 

0.49 

Johnson  and  Selander  (1971 ) 

Gophers 

2 

27 

0.12 

Patton  et  al.  (1972) 

Bats| 

3 

14 

0.51  ~ 

0.63 

Shaw  (1970) 

Lizards  (Anolis) 

4 

23 

1.32  ~ 

1.75 

Webster  et  al.  (1972) 

Amphisbaenian 

3 

22 

0.61  ~ 

1.01 

Kim  et  al.  (1975) 

(Bipes) 

Newts 

3 

18 

0.27  ~ 

0.57 

Hedgecock  and  Ayala  (1974) 

Teleosts 

3 

24 

0.36  ~ 

0.52 

Siciliano  et  al.  (1973) 

Drosophila 
Sibling  species 

18 

13  ~ 23 

0.18  ~ 

1.54 

Hubby  and  Throckmorton  (1968) 

3 

28 

0.61  ± 

0.071 

Ayala  and  Tracey  (1974) 

pseudoobscura 

2 

24 

0.05 

Prakash  (1969) 

vs.  persimilis 
Nonsibling 

27 

13  ~ 23 

1.3  ~ 

2.54 

Hubby  and  Throckmorton  (1968) 

species 

10 

27 

0.66  ~ 

1.91 

Lakovaaraet  al.  (1972a) 

4 

28 

1.12  ± 

0.14 

Ayala  and  Tracey  (1974) 

Myxomycetesf 

3 

22 

1.51  ~ 

2.73 

Shaw  (1970) 

Bacteria? 

8 

8 

0.29  ~ 

2.08 

Shaw  (1970) 

D.  Genera 

Fish  (Sciaenidae)f 

5 

16 

1.1  ~ 

2.8(00)  Shaw  (1970) 

E.  Families 

Man-Chimpanzee 

2 

42 

0.62 

King  and  Wilson  (1975) 

F.  Orders 

Man-Horse 

2 

~ 

imt 

Nei  (1973a) 

The  populations  studied  have  different  chromosome  numbers,  so  that  they  are  classi- 
fied as  distinct  subspecies. 

One  of  the  three  cave  populations  studied  apparently  receives  a small  amount  of  gene 
migration  from  surface  populations. 

t Only  a few  individuals  or  strains  from  each  species  were  studied,  so  that  the  reliability 
of  the  results  is  low.  One  of  the  twelve  pairs  of  genera  studied  in  fish  shared  no  common 
proteins.  So,  D = oo,  though  this  is  surely  due  to  the  small  numbers  of  loci  and  indi- 
viduals studied. 

tt  This  estimate  was  obtained  from  amino  acid  sequence  data  (see  text). 
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morphological  characters  were  subjected  to  stronger  natural  selection  than 
'average  genes'  in  the  process  of  racial  differentiation.  Note  that  the  number 
of  loci  controlling  the  difference  in  pigmentation  between  Caucasoids  and 
Negroids  has  been  estimated  to  be  about  3 to  4 (Stern,  1970). 

Nei  and  Roychoudhury  (1974b)  also  studied  the  genetic  distance  for  blood 
group  loci  among  the  three  major  races  of  man.  In  this  case  the  loci  used 
did  not  appear  to  be  a random  sample  of  the  genome  but  the  results  obtained 
were  very  similar  to  those  for  protein  loci. 

Although  the  average  genetic  distance  or  the  number  of  net  codon 
differences  per  locus  among  the  major  races  of  man  was  small,  there  was  a 
considerable  variation  in  single-locus  genetic  distance  among  loci  (fig.  7.1). 
In  a majority  of  loci  the  single-locus  distance  was  0.  That  is,  the  same  allele 
was  fixed  in  two  or  all  of  the  three  races.  On  the  other  hand,  there  were 
few  loci  at  which  the  distance  was  as  high  as  15  percent.  In  none  of  the  loci 
studied  were  different  alleles  fixed  in  different  races. 

With  the  help  of  Dr.  Arun  Roychoudhury,  I also  computed  the  interracial 
and  interspecific  genetic  distances  (standard  only)  in  other  organisms  from 
published  data.  The  results  obtained  are  presented  in  table  7.2.  Some  of 
the  estimates  in  this  table  were  directly  quoted  from  the  original  papers. 
The  genetic  distance  estimates  are  classified  into  five  categories  according 
to  the  rank  of  the  taxa  compared,  i.e.,  local  races,  subspecies,  species, 
genera,  and  families.  (The  genetic  distance  between  man  and  horse  was 
estimated  from  amino  acid  sequence  data.)  The  distinction  between  local 
races  and  subspecies  was  not  always  easy.  I generally  followed  the  classi- 
fication by  the  authors  who  published  gene  frequency  data,  but  when  there 
is  evidence  that  no  or  little  migration  occurs  between  a given  pair  of  taxa,  I 
classified  them  as  subspecies. 

The  genetic  distance  between  races  is  generally  very  small  and  always 
less  than  a few  percent.  The  largest  value  (0.058)  was  obtained  between 
Arizona  and  Texas  populations  in  kangaroo  rats.  This  organism,  however, 
apparently  has  a short  migration  distance  and  the  two  populations  may  be 
reproductively  isolated.  It  is  noted  that  the  average  gene  diversity  within 
populations  in  this  organism  is  only  0.008  per  locus  (Johnson  and  Selander, 
1971).  In  most  other  cases  the  distance  was  less  than  0.02.  This  result  is  in 
agreement  with  our  earlier  theoretical  conclusion  that  genetic  distance 
cannot  be  very  large  as  long  as  there  is  migration.  Also,  it  is  noted  that  the 
genetic  distances  among  major  races  of  man  are  of  the  same  order  of 
magnitude  as  those  of  local  races  in  other  organisms. 

Estimates  of  genetic  distance  between  subspecies  arc  generally  much  larger 
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than  those  between  races.  The  values  obtained  between  the  U.S.  mainland 
(Florida,  Louisiana,  and  Texas)  and  the  Rimini  Island  (in  the  Bahamas) 
populations  of  Anolis  caro/inensis  (lizards)  were  as  high  as  0.34.  This  is 
about  30  times  larger  than  the  genetic  distance  between  Caucasoids  and 
Negroids  in  man.  On  the  other  hand,  the  genetic  distance  between  the  A and 
I subspecies  of  Drosophila  paulistorum  in  Tapuruquara,  Brazil,  is  only  0.03. 

Table  7.3 

Estimates  of  genetic  distance  (D)  between  sibling  and  nonsibling  species,  and  relative 
divergence  time  (T)  of  nonsibling  to  sibling  species  in  nine  triads  of  Drosophila  species. 
In  each  triad  of  species  (a)  and  (b)  are  sibling  species,  while  (a)  and  (c)  or  (b)  and  (c)  are 
nonsibling  species.  The  data  analyzed  are  those  of  Hubby  and  Throckmorton  (1968). 
From  Nei  (1971a). 


Triad 

Species 

No.  of 
proteins 
examined 

D ± SE 
for  sibling 
species 

D ± SE  for 
nonsibling 
species 

Relative 
divergence 
time  (T) 

1 

a)  Arizonensis 

b)  mojavensis 

19.3 

0.76  ± 0.24 

2.26  ± 0.67 

L0 

2 

c)  mulleri 

a)  mercatorum 

b)  paranaensis 

19.3 

0.40  ±0.16 

1.58  ± 0.45 

4.0 

3 

c)  peninsularis 

a)  hydei 

b)  neohydei 

16.7 

0.74  ± 0.26 

2.41  ± 0.7 S 

3.3 

4 

c)  eohydei 

a ) fulvimacula 

b)  fulvimaculoides 

20.3 

0,45  ±0,17 

1,31  ± 0.36 

2.9 

5 

c)  limensis 

a)  melanica 

b)  paramelanica 

21.0 

1.25  ± 0.35 

1.95  ± 0.53 

1.6 

6 

c)  negromelanica 

a)  melanogaster 

b)  simulans 

19  0 

0.75  ± 0.24 

2.54  ± 0.75 

3,4 

7 

c)  takahashii 

a)  saltans 

b)  prosaltans 

20.3 

0,81  ± 0,25 

1,76  ± 0,49 

2.2 

8 

c)  emarginata 

a)  willistoni 

b)  paulistorum 

14,0 

1.S4  ± Q.Jt 

1,39  ± 0.46 

0,9 

9 

c)  nebulosa 

a)  victoria 

b)  lebanonensis 

1 4,3 

0,1  s ± <U2 

1 J6  ± 0.51 

90 

c)  pattersoni 
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It  is  worthwhile  to  note  that  the  genetic  distance  between  the  Bogota  (Colom- 
bia) and  United  States  populations  of  D.  pseudoobscura  is  about  0.1 1,  though 
they  are  generally  classified  as  local  races.  Interestingly,  however,  Prakash 
(1972)  recently  discovered  that  F1  males  obtained  from  the  cross  of  Bogota 
females  x U.S.  males  are  sterile.  Clearly,  they  are  now  in  the  process  of 
speciation. 

Genetic  distance  between  different  species  is  generally  still  larger  than  that 
between  different  subspecies.  In  some  extreme  cases  it  is  as  large  as  2.7, 
about  ten  times  larger  than  intersubspecific  distances.  If  we  take  into  account 
the  possibility  that  codon  differences  are  grossly  underestimated  when  D 
is  larger  than  1,  the  actual  interspecific  gene  differences  must  be  much 
larger  than  intersubspecific  differences.  Nevertheless,  there  is  considerable 
variation  in  the  estimate  of  D and  in  some  cases  it  is  as  small  as  or  even 
smaller  than  some  intersubspecific  genetic  distances.  This  variation  is  of 
course  expected  since  the  definition  of  species  largely  depends  on  reproductive 
isolation  and  morphological  differences.  Theoretically,  reproductive  isolation 
can  be  attained  by  only  a few  gene  substitutions,  as  will  be  discussed  later. 

Some  species  in  animals  are  morphologically  very  similar  but  repro- 
ductively  isolated.  They  are  usually  called  sibling  species  and  are  quite 
common  in  invertebrates.  The  genetic  differences  between  these  sibling 
species  compared  with  those  between  nonsibling  species  have  been  a subject 
of  speculation  for  a long  time.  Arguing  that  for  a new  species  to  be  established 
a 'major  genetic  reorganization'  is  required,  Mayr  (1963)  postulated  that 
'sibling  species  show  the  same  degree  of  genetic  differences  as  do  other 
closely  related  nonsibling  species'.  Hubby  and  Throckmorton  (1968)  studied 
this  problem  by  examining  the  protein  differences  between  sibling  species 
and  between  nonsibling  species  in  Drosophila.  The  results  obtained  are  given 
in  table  7.3  in  terms  of  genetic  distance  reanalyzed  by  Nei  (1971a).  In  this 
case  only  a small  number  of  inbred  flies  from  each  species  were  examined. 
Also,  electrophoretic  mobility  of  proteins  was  compared  without  conducting 
genetic  analysis.  Therefore,  the  D values  in  table  7.3  are  probably  over- 
estimated. If  we  neglect  the  second  factor,  the  probable  maximum  amount 
of  overestimation  is  about  0.12,  which  is  equal  to  the  estimate  of  intra- 
specific heterozygosity  in  Drosophila.  At  any  rate,  it  is  clear  from  the  table 
that  genetic  distances  between  nonsibling  species  are  considerably  larger 
than  those  between  sibling  species,  though  sampling  error  is  very  large. 
This  is  contrary  to  Mayr's  postulate  but  confirms  and  reinforces  Hubby  and 
Throckmorton's  conclusion  that  sibling  species  are  genetically  more  similar 
than  nonsibling  species. 
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In  this  connection  one  might  wonder  how  many  gene  substitutions  are 
required  for  a new  species  to  be  formed  from  a local  population.  Haldane's 
(1957a)  guess  of  this  number  was  1000.  But  this  cannot  be  answered  by 
examining  the  interspecific  gene  differences,  since  some  gene  substitutions 
may  not  have  been  required  but  just  happened.  We  can,  however,  answer 
the  following  question:  how  many  gene  substitutions  generally  occur  when 
ll  new  species  is  formed?  The  answer  to  this  question  can  be  obtained  by 
examining  the  minimum  number  of  gene  differences  between  species.  In 
table  7.2  the  smallest  interspecific  genetic  distance  is  that  between  D.  pseudo- 
obscura  and  D.  persimilis  and  it  is  only  0.05.  The  next  smallest  value  is 
between  D.  victoria  and  D.  lebanonensis  (table  7.3).  As  noted  earlier,  this 
value  is  apparently  overestimated  because  the  intraspecific  polymorphism 
has  been  neglected.  If  we  make  a correction,  it  becomes  0.18  - 0.12  = 0.06 
roughly.  Therefore,  if  electrophoresis  detects  only  a quarter  of  codon 
differences,  the  actual  number  of  codon  differences  is  estimated  to  be  about 
0.2  per  locus,  neglecting  synonymous  codons.  If  a Drosophila  genome  has 
5000  structural  genes,  this  is  equivalent  to  1000  codon  differences  per 
genome.  If  both  species  compared  experienced  an  equal  number  of  gene 
substitutions  during  speciation,  about  500  gene  (codon)  substitutions  must 
have  occurred  in  each  species.  Interestingly,  this  is  not  far  from  Haldane's 
guess. 

Gene  differences  between  different  genera  have  been  studied  only  in  a 
few  organisms  (Shaw,  1970).  The  data  in  the  family  Sciaenidae  in  fish 
indicate  that  intergeneric  genetic  distance  is  still  larger  than  interspecific 
distance  (table  7.2).  In  all  cases  examined  the  D value  was  larger  than  1. 
In  one  of  the  twelve  intergeneric  comparisons  studied  no  common  proteins 
were  shared  by  the  two  genera,  so  that  D turned  out  to  be  oo.  This,  of  course, 
may  be  due  to  sampling  error,  since  the  number  of  loci  studied  is  only  16. 
Shaw  also  studied  the  protein  identities  among  six  different  genera  in  a 
family  of  bacteria,  the  Entero-bacteriaceae.  Curiously,  the  genetic  distance 
between  species  of  three  genera,  Escherichia,  Shigella,  and  Salmonella  were 
of  the  same  order  of  magnitude  as  interspecific  genetic  distance.  This  is, 
however,  understandable,  since  bacterial  taxonomists  have  long  suspected 
that  they  might  be  subspecies  (Shaw,  1970).  On  the  other  hand,  none  of  the 
eight  proteins  studied  was  shared  by  Shigella  flexneri.  Salmonella  typhimu- 
rium,  S.  typhi,  on  one  hand,  and  Klebsiella  pneumoniae,  Serratia  marcescens, 
Proteus  vulgaris,  on  the  other.  There  were  one  or  two  common  proteins 
among  the  latter  group  of  three  species.  Thus,  the  intergeneric  genetic 
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distance  is  apparently  quite  large  as  expected,  though  a more  extensive  and 
careful  study  should  be  made. 

Recently,  King  and  Wilson  (1975)  studied  the  electrophoretically  detect- 
able protein  differences  between  man  and  chimpanzee.  These  two  organisms 
belong  to  different  families,  but  surprisingly  the  genetic  distance  was  only 
0.62,  which  corresponds  to  the  interspecific  genetic  distance  in  other  organ- 
isms. This  dilemma  may  be  resolved  by  one  of  three  possible  explanations. 
First,  primates  have  been  considerably  oversplit  relative  to  other  groups  as  a 
simple  result  of  anthropocentrism.  Second,  morphological  differences 
between  species  in  other  taxa  are  not  as  easily  distinguishable  as  differences 
between  primates.  Third,  for  a given  amount  of  change  at  the  gene  level  there 
has  been  more  morphological  and  behavioral  change  between  man  and 
chimpanzee  than  between  species  in  other  organisms.  Arguing  that  the  actual 
morphological  differences  between  man  and  chimpanzee  are  much  larger 
than  those  between  species  of  house  mouse,  lizards,  and  Drosophila,  King 
and  Wilson  prefer  the  third  explanation. 

As  noted  earlier,  the  estimate  of  D is  not  reliable  when  I is  close  to  0, 
unless  a large  number  of  proteins  are  studied.  However,  if  amino  acid 
sequence  data  are  available  and  2 Xt  is  obtainable,  D can  be  estimated  for 
any  pair  of  organisms  by  using  the  relation  D = 2cnXt . As  an  example,  let 
us  consider  the  genetic  distance  between  man  and  horse.  We  use  amino 
acid  sequence  data  for  the  /1-chain  of  hemoglobin,  since  the  rate  of  amino 
acid  substitution  for  this  polypeptide  is  close  to  the  average  rate  for  many 
proteins.  It  is  known  that  the  number  of  amino  acid  differences  between 
human  and  horse  /1-chains  is  25.  Since  a /?-chain  consists  of  146  amino  acids, 
2 Xt  can  be  estimated  by  - loge(l  - 25/146),  which  becomes  0.188.  Multi- 
plying this  number  by  n = 146,  we  get  2nXt  = 27.4  for  the  /?-chain.  However, 
hemoglobin  /J-chain  is  a relatively  small  polypeptide.  The  'average  poly- 
peptide' appears  to  consist  of  some  400  amino  acids.  Thus,  the  genetic 
distance  between  man  and  horse  when  c = 1 would  be  roughly  75  codon 
differences  per  locus.  T o compare  this  with  the  values  of  D obtained  from 
electrophoretic  studies,  it  must  be  multiplied  by  c ~ 1/4.  Then,  we  have  D = 
19  approximately.  Therefore,  the  gene  differences  between  man  and  horse 
are  about  40  times  larger  than  those  between  man  and  chimpanzee  and  about 
200  times  larger  than  those  between  Caucasoids  and  Negroids  in  man.  Of 
course,  these  estimates  are  very  rough,  and  to  get  more  reliable  estimates, 
we  must  use  amino  acid  sequence  data  for  many  proteins. 

In  the  future  the  technology  of  amino  acid  sequencing  will  be  advanced 
and  this  will  make  it  possible  to  study  the  genic  variation  within  and  between 
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populations  at  the  codon  level  directly.  Then,  we  will  be  able  to  estimate 
genetic  distance  more  accurately,  since  c can  be  equated  to  1 . Also,  if  enough 
data  are  available,  we  will  be  able  to  compute  genetic  distance  between  any 
pair  of  organisms  or  taxa,  so  that  all  organisms  may  be  compared  by  means 
of  the  same  scale,  i.e.,  the  average  number  of  codon  differences  per  locus. 

One  might  wonder  whether  genetic  distance  is  useful  for  defining  a 
species.  In  higher  organisms  the  definition  of  species  depends  on  morpho- 
logical differences  as  well  as  on  reproductive  isolation.  If  two  groups  of 
organisms  are  reproductively  isolated,  they  are  defined  as  distinct  species 
even  if  they  are  morphologically  very  similar.  (Of  course,  we  exclude  asexual 
organisms  in  this  case.)  Since  reproductive  isolation  can  be  attained  by  a 
relatively  small  number  of  gene  substitutions,  genetic  distance  may  vary 
considerably  among  different  pairs  of  species,  as  we  have  seen.  Therefore, 
species  cannot  be  defined  in  terms  of  genetic  distance  alone.  Nevertheless, 
it  is  a measure  of  evolutionary  relationships  between  species,  so  that  it  will 
be  an  important  taxonomic  criterion  in  the  future.  Particularly,  in  those 
groups  of  bacteria  and  fungi  in  which  no  sexual  reproduction  is  observed, 
it  may  solve  many  taxonomic  problems.  Stout  and  Shaw  (1974)  recently 
showed  that  the  proportion  of  common  proteins  shared  by  several  strains  of 
Mucor  racemosus  showing  similar  morphological  characters  is  less  than  10 
percent.  They  suggest  that  these  strains  should  represent  distinct  species. 


7.4  Phylogeny  of  closely  related  organisms 

One  of  the  important  tasks  in  evolutionary  studies  is  to  clarify  the  phylo- 
genetic relationship  among  different  organisms.  If  we  know  this  relationship 
together  with  the  evolutionary  time,  we  will  be  able  to  understand  what 
kinds  of  genetic  changes  were  important  in  creating  a new  species  or  a new 
group  of  organisms.  We  will  also  be  able  to  estimate  the  rate  at  which  acertain 
morphological  or  physiological  character  has  evolved.  Thanks  to  the  great 
efforts  of  biologists  in  the  19th  and  early  20th  centuries,  we  know  the  major 
aspects  of  phylogeny  in  animals  and  plants.  This  knowledge  has  been  very 
important  in  the  subsequent  studies  of  evolutionary  mechanisms.  Our  recent 
estimates  of  the  rate  of  amino  acid  substitution  in  proteins  or  nucleotide 
substitutions  in  DNA  could  not  have  been  obtained  without  this  knowledge. 

Yet,  our  knowledge  about  the  phylogeny  of  animals  and  plants  is  far 
from  complete.  In  fact,  we  know  virtually  nothing  about  the  phylogenetic 
relationships  among  closely  related  taxa  except  in  some  special  organisms. 
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This  is  because  in  a majority  of  organisms  the  fossil  record  at  the  species 
level  is  nonexistent.  The  phylogenetic  relationship  can  be  inferred  to  some 
extent  by  studying  the  morphological  affinity.  Strictly  speaking,  however, 
the  morphological  affinity  of  taxa  does  not  necessarily  represent  the  real 
phylogeny.  Thus,  Sokal  and  Sneath  (1963)  stressed  the  separation  of  the 
so-called  phenetic  (similarity)  and  phyletic  (phylogeny)  relationships. 
Numerical  taxonomy  applied  to  morphological  characters  always  gives  only 
the  phenetic  relation  of  taxa. 

In  the  past,  of  course,  there  have  been  some  successful  attempts  to  clarify 
the  phylogenetic  relationship  among  closely  related  organisms  where  fossil 
records  are  missing.  Particularly  important  is  the  study  of  chromosomal 
relationships  among  related  taxa.  Since  chromosomal  changes  in  the 
evolutionary  process  are  generally  unique  and  very  slow,  it  is  often  possible 
to  trace  the  evolutionary  scheme  of  a group  of  species  or  genera.  A most 
beautiful  example  is  Cleland's  (1972)  study  on  the  evolution  of  the  North 
American  evening  primrose,  Oenothera.  Examining  the  patterns  of  chromo- 
somal translocations  in  the  genomes  of  each  of  the  six  species  (Oe.  strigosa, 
Oe.  biennis,  Oe.  grandiflora,  Oe.  parviflora , Oe.  hookari,  and  Oe.  argillicola), 
he  clarified  the  phylogeny  of  these  species.  Nevertheless,  this  technique 
cannot  be  used  universally,  since  few  chromosomal  changes  have  occurred 
in  some  organisms.  Also,  it  cannot  provide  any  quantitative  estimate  of 
relative  or  absolute  evolutionary  time. 

However,  we  are  now  in  a position  to  make  a more  reliable  and  quantita- 
tive phylogenetic  tree.  At  the  codon  level,  gene  substitution  in  evolution  is  a 
slow  process  and  seems  to  proceed  roughly  at  a constant  rate  per  unit 
chronological  time.  The  probability  of  back  mutations  or  parallel  mutations 
at  a codon  is  negligibly  small  unless  evolutionary  time  is  very  large.  Thus, 
the  phylogeny  of  a group  of  taxa  can  be  studied  by  using  genetic  distance. 
This  method  has  a great  advantage  over  the  conventional  method  of  com- 
parative morphology,  in  which  convergence  and  divergence  in  morphological 
changes  always  make  the  results  uncertain  (see  Sokal  and  Sneath,  1963). 

7A.1  Evolutionary  rime 

In  section  7.2  we  have  indicated  that  a rough  divergence  time  between  el  pair 
of  isolated  taxa  can  be  estimated  from  electrophoretic  data  by  t = 
as  long  as  D is  small,  say  less  than  1.  If  D is  large,  the  above  method  is 
expected  to  give  an  underestimate.  Tt  also  gives  an  underestimate  if  a varies 
among  loci.  Some  corrections  for  these  factors  can  be  made  under  certain 


Phytogeny  of  closely  related  organisms 


193 


circumstances  (Nei,  1971a;  Nei  and  Chakraborty,  1973).  Inch.  3 we  estimated 
a to  be  roughly  I0-T  per  year  for  electrophoretically  detectable  proteins. 
Therefore,  a crude  estimate  of  divergence  time  can  be  obtained  by 


( = Sx  10  (7.15) 

It  should  be  emphasized  that  our  estimates  of  a depend  on  a number  of 
assumptions  about  the  biochemical  properties  of  proteins.  In  my  1971  paper 
I used  a = 6.8  x 10" 7 in  analyzing  Hubby  and  Throckmorton’s  (1965, 
1968)  data  on  protein  identity.  This  is  because  these  authors  used  each 
electrophoretic  band  as  a unit  of  comparison  rather  than  each  polypeptide 
without  conducting  any  genetic  analysis.  For  the  current  genetic  data,  how- 
ever, a = 10“ 7 seems  to  be  better,  though  this  is  also  subject  to  a large 
standard  error.  It  should  also  be  noted  that  a varies  considerably  with 
protein.  So,  the  mean  value  of  a also  should  vary  according  to  the  proteins 
used.  In  fact,  M.  King  (1973)  estimates  that  the  a value  is  about  ten  times 
smaller  for  intracellular  proteins  than  for  extracellular  proteins.  It  is  hoped 
that  in  the  future  a more  reliable  estimate  of  a will  be  obtained.  If  a changes 
in  the  future,  the  estimates  of  divergence  time  in  this  section  will  also  change. 

Nevertheless,  it  is  important  to  get  a rough  idea  of  the  divergence  time 
between  a particular  pair  of  taxa,  since  we  can  then  study  other  problems 
such  as  morphological  changes  and  reproductive  isolation  more  quantita- 
tively. It  should  be  noted  that  the  exact  divergence  time  will  never  be  known 
in  practice.  This  is  because,  in  order  to  know  this  time,  all  information  about 
the  process  of  speciation  and  natural  selection  is  required.  In  many  organisms 
fossil  records  are  not  available,  particularly  for  the  evolution  of  closely 
related  species.  Furthermore,  even  if  they  are  available,  they  provide  only 
rough  estimates  of  divergence  time,  since  morphological  changes  observed 
in  fossils  should  have  occurred  much  later  than  the  actual  isolation  (re- 
productive or  geographical)  of  the  taxa  in  question. 

At  any  rate,  if  we  use  formula  (7.15),  we  can  estimate  rough  evolutionary 
times  for  subspecies  and  species.  Interracial  divergence  time  is  also  estimable, 
if  the  two  races  in  question  have  been  reproductively  isolated  during  the 
gene  differentiation.  In  many  cases,  however,  this  is  not  always  clear.  The 
three  major  races  of  man,  Caucasoids,  Negroids,  and  Mongoloids  are 
roughly  distinguishable  in  terms  of  such  characters  as  pigmentation,  facial 
structure,  and  hair  texture.  This  suggests  that  the  main  groups  of  these 
races  have  been  isolated  geographically  for  a considerable  period  of  time, 
though  some  degree  of  gene  mixture  must  have  occurred.  Using  35  protein 
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loci  common  to  the  three  races,  Nei  and  Roychoudhury  (1974b)  estimated 
the  genetic  distances  and  divergence  times  as  follows: 

D t (years) 

Caucasoid  vs.  Negroid  0.023  115,000 

Caucasoid  vs.  Mongoloid  0.011  55,000 

Negroid  vs.  Mongoloid  0.024  120,000 

Here  Negroid  refers  to  African  Negroids  rather  than  American  Negroids. 
Since  in  an  early  stage  of  population  differentiation  some  migration  must 
have  occurred,  these  estimates  of  divergence  time  may  be  minimal.  There- 
fore, the  three  major  races  appear  to  have  been  isolated  at  least  50  — 100 
thousand  years.  These  estimates  are  not  inconsistent  with  the  present  fossil 
records  about  early  man.  They  are  also  of  the  same  order  of  magnitude  as  the 
estimate  (25,000  ~ 100,000  years)  obtained  by  Cavalli-Sforza  (1969)  using 
an  entirely  different  method. 

In  this  connection  it  is  interesting  to  estimate  the  maximum  possible 
migration  rate  which  might  have  occurred  among  the  three  major  races. 
This  can  be  obtained  by  assuming  that  the  genetic  distances  among  them 
have  reached  the  steady  state  value.  Namely,  the  maximum  possible  migra- 
tion rate  between  two  races  [m  = (m,  + m2)/2]  can  be  estimated  from  I = 
exp  (-  D)  = m/(m  + v)  in  (7.14).  If  we  assume  v = 2 x 10“ 6 per  generation, 
then  m is  1 x I0-4  per  generation  between  Caucasoids  and  Negroids  and 
2 x TO-4  between  Caucasoids  and  Mongoloids.  This  suggests  that  the  rate 
of  migration  between  the  three  major  races,  if  any,  was  very  small. 

It  is  not  clear  how  the  interracial  genetic  distances  in  other  organisms  in 
table  7.2  are  related  to  evolutionary  time,  since  little  is  known  about  the 
migration  among  races.  In  the  case  of  pocket  gophers  the  large  value  of 
D = 0.06  is  probably  due  to  isolation,  as  mentioned  earlier.  If  so,  this 
corresponds  to  an  isolation  of  about  300  thousand  years. 

On  the  other  hand,  many  subspecies  seem  to  have  been  isolated  for  a 
long  period  of  time  - about  150  thousand  to  1.5  million  years,  though  the 
standard  error  is  very  large.  The  divergence  time  for  species  seems  to  be  still 
larger  in  general.  The  average  seems  to  be  nearly  five  million  years.  However, 
the  variation  among  species  is  very  large.  The  divergence  time  between 
D.  pseudoobscura  and  D.  persimilis  is  estimated  to  be  about  250,000  years, 
while  in  some  organisms  such  as  lizards  in  the  Bimini  Island  and  some  non- 
sibling Drosophila  species  the  divergence  time  seems  to  be  at  least  about 
10  million  years.  From  the  studies  on  fossil  records  from  various  organisms, 
mostly  vertebrates,  Rensch  (1960)  concluded  that  the  average  age  of  recent 
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species  is  somewhere  between  100,000  and  a few  million  years.  Our  estimates 
seem  to  be  consistent  with  Rensch's  conclusion.  In  the  case  of  gophers 
{Tlwmamys  sal  pintles)  Nevo  et  al.  (1974)  showed  that  the  estimates  of 
evolutionary  times  from  protein  data  agree  fairly  well  with  the  fossil  records 
available.  The  average  evolutionary  time  for  genera  seems  to  be  much  longer 
than  that  for  species,  but  our  method  apparently  does  not  provide  reliable 
estimates,  since  the  standard  error  of  D is  very  large  when  D is  large  or  I is 
small. 

Some  special  comments  should  be  made  about  the  divergence  time  between 
man  and  chimpanzee.  If  we  use  King  and  Wilson's  (1975)  estimate  of  genetic 
distance  (D  = 0.62),  the  divergence  time  becomes  3.1  million  years.  This  is 
smaller  than  any  estimate  so  far  obtained  and  almost  certainly  erroneous. 
We  note,  however,  that  this  estimate  is  subject  to  a large  standard  error. 
M.  King  (1973)  has  analyzed  her  data  differently.  According  to  her,  the  rate 
of  amino  acid  substitutions  per  locus  that  are  detectable  by  electrophoresis 
is  different  between  intracellular  and  extracellular  proteins.  Her  estimate 
is  2.9  x 10“ a for  the  former  proteins  and  1.9  x 10"  7 for  the  latter.  On  the 
other  hand,  the  electrophoretic  identity  of  proteins  (I)  is  0.71  for  the  former 
and  0.14  for  the  latter.  Therefore,  the  divergence  time  is  estimated  to  be 
- log,  0.71/(5.8  x 1G-S)  = 5.9  x 106  from  intracellular  proteins  and 
5.2  x 10e  years  from  extracellular  proteins.  These  estimates  are  in  good 
agreement  with  Sarich  and  Wilson's  (1967)  estimate  of  4 ~ 5 million  years 
from  immunological  studies  of  albumin.  We  shall  discuss  this  problem  again 
in  the  next  chapter. 

Our  theory  of  estimation  of  divergence  time  between  two  populations  is 
based  on  the  assumption  that  tlje  effective  size  is  the  same  for  the  two 
populations.  In  practice,  our  formula  is  quite  robust  and  seems  to  be 
approximately  applicable  even  if  one  population  is  ten  times  smaller  or 
larger  than  the  other.  In  nature,  however,  a group  of  individuals  is  occa- 
sionally split  from  a population  and  occupies  a new  territory  to  undergo 
an  independent  evolution,  while  the  original  population  stays  in  the  same 
old  territory.  In  such  a case  the  size  of  the  new  population  may  be  drastically 
different  from  that  of  the  original  population.  Formula  (7.7)  then  does  not 
hold.  However,  it  can  be  shown  that  if  we  redefine  I as 

- JxrUx,  am 

where  Xand  Y refer  to  the  original  and  the  descendant  populations,  respec- 
tively, then  (7.7)  still  holds  (Chakraborty  and  Nei,  1974).  Therefore,  the 
divergence  time  can  be  estimated  by  (7.13). 


196 


Differentiation  of  populations  and  speciation 


Table  7.4 


Probability  of  identity  of  genes  within  and  between  two  cave  and  two  surface  populations 
of  Astyanax  mexicanus.  The  data  used  are  those  of  Avise  and  Selander  (1972).  From 
Chakraborty  and  Nei  (1974). 


Cave  populations 

Surface  populations 

Pachon 

Los  Sabinos 

Arroyo  B 

Arroyo  Valles 

Pachon 

1.0000 

0.7976 

0.7788 

0.7541 

Los  Sabinos 

0.9640 

0.8043 

0.7808 

Arroyo  B 

0.8978 

0.8781 

Arroyo  Valles 

0.8668 

Evolution  in  the  cave  fish  Astyanax  mexicanus  serves  as  an  interesting 
example  in  this  case.  Avise  and  Selander  (1972)  studied  the  gene  frequencies 
for  17  protein  loci  in  three  cave  and  six  river  populations  of  the  characid 
fish  Astyanax  mexicanus  in  Mexico.  One  of  the  cave  populations  studied, 
i.e. , Pachon,  appears  to  be  almost  entirely  isolated  from  the  river  populations, 
and  the  fish  in  this  cave  are  uniformly  eyeless  and  unpigmented.  The  fish 
in  another  cave,  Los  Sabinos,  are  also  uniformly  eyeless  and  unpigmented, 
but  there  is  a possibility  that  migration  occurs  between  this  cave  and  its 
neighboring  river  populations  at  the  time  of  flooding  after  heavy  rain.  The 
third  cave  (Chica)  contains  fish  showing  the  full  range  of  variation  from 
eyeless  and  unpigmented  to  fully  eyed  and  darkly  pigmented,  and  there  is 
evidence  that  migration  occurs  between  this  cave  and  its  neighboring  river 
populations.  The  size  of  these  cave  populations  has  been  estimated  to  be 
200  to  500,  while  the  size  of  river  populations  is  not  known  but  very  large. 
It  is  believed  that  the  caves  in  this  region  of  Mexico  were  formed  before 
the  end  of  the  Pleistocene  (10,000  to  2,000,000  years  ago).  The  estimates 
of  Jx,  JXY,  and  JY  for  the  two  cave  populations  and  their  respective  neigh- 
boring river  populations  (Arroyo  B and  Arroyo  Valles)  are  given  in  table  7.4. 
It  is  seen  that  the  homozygosities  of  the  two  cave  populations  are  both  very 
high,  as  expected  from  their  small  population  sizes.  On  the  other  hand,  the 
two  river  populations  are  highly  heterozygous  and  share  a large  fraction  of 
common  genes,  the  normalized  identity  of  genes  between  the  two  popula- 
tions (I)  being  0.995.  The  identity  probabilities  between  the  cave  and  river 
populations  indicate  that  a substantial  gene  differentiation  has  occurred 
between  these  populations.  We  assume  that  the  ancestral  populations  of  the 
Pachon  and  Los  Sabinos  fish  are  their  nearby  river  populations  Arroyo  B 
and  Arroyo  Valles,  respectively,  and  that  the  average  homozygosity  (J,)  of 
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each  cave  population  when  it  was  formed  was  the  same  as  the  present  level 
of  homozygosity  in  its  ancestral  population.  Then,  the  J,  value  is  0.77881 
0.8978  = 0.8675  for  the  Pachon  cave  and  0.9008  for  the  Los  Sabinos.  Thus, 
the  genetic  distance,  D = 2cct  is  0.1422  for  the  former  and  0.1045  for  the 
latter.  The  estimate  of  evolutionary  time  then  becomes  roughly  700,000  years 
for  the  Pachon  population  and  500,000  years  for  the  Los  Sabinos  population. 
Interestingly,  these  estimates  agree  well  with  the  geological  estimate  of  the 
time  of  cave  formation. 

As  mentioned  earlier,  there  is  the  possibility  that  a low  rate  of  migration 
occurs  from  rivers  to  the  Los  Sabinos  population.  A slightly  lower  estimate 
of  evolutionary  time  for  this  population  than  for  the  Pachon  may  be  due 
to  this  migration.  A maximum  estimate  of  the  migration  rate  can  be  obtained 
by  using  (7.14).  In  this  case  migration  must  be  unidirectional  from  the  river 
to  the  cave  population.  At  the  steady  state,  therefore,  we  have  I = m2f 
(m2  + 2v)  = 0.9008.  If  we  assume  that  the  generation  time  for  this  fish  is 
6 years,  the  mutation  rate  per  generation  (v)  is  estimated  to  be  6 x 10’ 1 per 
generation.  Then,  a maximum  estimate  of  migration  rate  is  1.2  x 10“ 5 per 
generation.  This  suggests  that  the  rate  of  migration  is  very  small  if  it  really 
occurs. 

7.4.2  Phylogenetic  trees 

To  my  knowledge,  the  first  phylogenetic  tree  based  on  'genetic  distance' 
was  constructed  by  Cavalli-Sforza  and  Edwards  (1964)  in  man.  They  studied 
the  evolutionary  scheme  of  human  races  by  using  a sizable  number  of  blood 
group  loci.  Their  measure  of  genetic  distance  was  the  angular  transformation 
originally  suggested  by  R.  A.  Fisher.  Although  this  measure  is  not  a simple 
function  of  evolutionary  time,  the  results  obtained  seemed  to  agree  fairly 
well  with  historical  evidence.  This  is  probably  because  the  interracial  gene 
differences  in  man  are  so  small,  that  most  genetic  distance  measures  become 
approximately  linear  with  divergence  time. 

After  Cavalli-Sforza  and  Edwards's  work,  many  authors  constructed 
phylogenetic  trees  or  dendrograms  for  various  organisms.  The  data  used  are 
of  various  kinds,  that  is,  the  number  of  amino  acid  differences  in  some 
proteins  (Fitch  and  Margoliash,  1967a),  electrophoretic  identity  of  proteins 
(Nei,  1971a;  Nair  et  al.,  1971;  Lakovaara  et  al.,  1972a),  gene  frequencies 
at  protein  or  blood  group  loci  (Fitch  and  Neel,  1969;  Johnson  and  Selander, 
1971).  These  different  kinds  of  data  were  analyzed  by  using  different  distance 
measures,  so  that  they  cannot  be  directly  compared.  However,  if  we  use  the 


Table  7.5 

Estimates  of  genetic  distance  between  species  of  Anolis  lizards  (A.  roquet  group).  From  Yang  et  al.  (1974). 


&?($)  wffl)  re  ex  tr  sc  ri  lu  bl 


ae(B}  0.004  ±0.003 
M Q.  105  ±0.063 

ex  0.137  ±0.080 

tr  0.235  ±0,107 

tr  0.276  ±0.122 

rf  0,324  ±0,1 31 

Iti  0.493  ±0,169 

bl  0.700±0,217 

bo  0.459±C.I63 


0,106  ±0.065 

OJ,39±t>.Oai 

0l2J2±O.||1 
Ot29J±O.J26 
0.31  l±0t  129 
0,3 12  ±0.|  73 
0.10B  ±0.220 
0.469-0.167 


o.oi3±o.ooa 

0.191  ±0.092 
0.31 3 ±0.129 
0.416  ±0.1 5$ 
0.437  ±0,1 56 
0.6S  3±Q.2Ql 
0.41 3 ±0,1 55 


0.205-0.096 
0.303  ±0.126 
0.436±0.161 
0,465  ±0,162 
0.631  ±0.201 
0.447  ±0.1 63 


0.342  ±0,135 
0-369  ±0,143 
0.395  ±0.1 46 
0.639  ±0.206 
0.423  ±0.1 52 


0.371  ±0,145 

0.426±0.M5  0 698  ±0.2 1 4 

0.710  ±0.220  0.973  ±0.281  0.326-0.132 

0.506  ±0.176  0.761  ±0.234  0.295  - 0 1 20  0. 176  ±0.002 


Note:  The  species  studied  are  as  follows:  aeneus  ( ae(G ) and  ae(B)),  roquet  (ro),  extremus  (ex),trinitatis  (tr),griseus  (gr),  richardi(ri),  luciae 
(Iu),  blanquillanus  (bl),  and  bonairensis  (bo). 
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distance  measure  given  in  section  7.1,  all  data  can  be  analyzed  by  the  same 
method,  though  some  adjustments  are  required  for  detectability  of  gene 
differences. 

It  is  also  noted  that  in  some  studies  only  a few  loci  were  used  for  con- 
structing phylogenetic  trees.  For  making  a reliable  tree,  however,  a large 
number  of  loci  should  be  used  particularly  when  the  organisms  involved 
are  closely  related.  As  we  have  seen  in  ch.  5,  gene  frequency  may  change 
at  random  due  to  genetic  drift,  so  that  single  locus  data  are  not  reliable. 
If  we  use  a large  number  of  loci,  such  effects  of  genetic  drift  as  well  as  the 
effects  of  natural  selection  varying  for  different  loci  are  averaged  out.  It  is 
also  important  to  use  loci  which  are  ideally  a random  sample  of  the  genome. 

In  this  section  we  shall  discuss  the  phylogenetic  trees  among  closely  related 
species,  deferring  those  for  organisms  of  higher  ranks  to  the  next  chapter. 
The  distance  measure  to  be  used  is  the  'standard'  genetic  distance  given  in 
section  7.1.  We  shall  discuss  only  the  principles  of  making  trees.  When  a 
tree  is  produced  from  a group  of  incompletely  isolated  populations,  it  may 
not  represent  the  real  evolutionary  history  of  the  populations  at  all.  But,  it 
represents  the  genetic  relationship  among  them  at  the  time  gene  frequency 
survey  is  made.  In  this  case  the  tree  produced  is  often  called  a dendrogram. 

In  order  to  make  a phylogenetic  tree  or  dendrogram  it  is  first  required 
to  produce  a matrix  of  genetic  distances  among  all  combinations  of  taxa. 
One  such  example  is  given  in  table  7.5.  If  this  sort  of  distance  matrix  is  given, 
there  are  several  ways  to  produce  a tree  (Sneath  and  Sokal,  1973).  The 
simplest  method  is  to  use  the  unweighted  pair-group  method  of  clustering 
by  Sokal  and  Sneath  (1963).  The  first  two  groups  to  be  clustered  are  those 
with  the  smallest  genetic  distance.  These  two  groups  are  then  combined  and 
taken  to  be  a single  group.  New  estimates  of  genetic  distance  between  this 
combined  group  and  other  groups  are  calculated.  The  same  procedure  is 
followed  until  all  groups  are  clustered  into  one  single  family. 

As  an  example,  suppose  that  there  are  four  groups  and  the  genetic 
distances  are  as  follows: 

Group  123 

2 £>12 

3 Dlt  £*13 

4 0|  + 02+  034 

Here  Dtj  denotes  the  genetic  distance  between  groups  i and  j.  Suppose  that 
the  genetic  distance  between  groups  3 and  4 is  the  smallest.  These  two 
groups  are  clustered  with  a branching  point  located  at  distance  D34.  They 
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are  then  combined  into  one  single  group.  New  estimates  of  genetic  distance 
between  this  combined  group  and  other  groups  are  calculated.  That  is, 

Group  1 2 

2 £>12 

(3  + 4)  £>1(34) 

Our  measure  of  genetic  distance  is  the  number  of  codon  differences  and  a 
linear  function  of  evolutionary  time.  Therefore,  £?  t ( 3 * j and  £> j(j4j  are  given 
by  (£>lj  + #i*)/2  and  (£>23  + D14’|;'2a  respectively.  If  £> 3,(34)  is  the  smallest, 
then  group  2 joins  the  3-4  cluster  with  a branching  point  located  at  distance 
In  this  case,  group  1 is  the  last  to  be  clustered.  The  branching  point 
at  which  this  group  joins  the  others  is  £>1(234)  = (£>12  + £>13  + £>i4)/3. 
If  £>1(34)  is  the  smallest,  group  1 joins  the  cluster  first  and  then  group  2.  On 
the  other  hand,  if  £>  3 1 is  smaller  than  any  of  and  £>3.(34),  groups  1 and 

2 are  clustered  and  then  the  two  clusters  1-2  and  3-4  are  joined  into  a single 
cluster. 

It  should  be  noted  that  the  above  pair-group  method  of  clustering  is 
based  on  the  assumption  that  the  rate  of  gene  substitution  per  unit  length 
of  time  is  constant  in  all  evolutionary  branches.  Cavalli-Sforza  and  Edwards 
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Fig.  7.2.  Phylogenetic  tree  for  the  nine  species  of  Anolis  roquet  group.  This  tree  was 
produced  from  the  genetic  distance  data  in  table  7.5.  The  estimate  of  absolute  evolutionary 
time  should  be  regarded  as  only  provisional.  Yang  et  al.  (1974)  have  obtained  a different 
evolutionary  time. 
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(1967)  and  Fitch  and  Margoliash  (1967a)  developed  a method  of  minimum 
evolution,  which  does  not  require  the  above  assumption.  Using  a similar 
technique,  Farris  (1974)  produced  a phylogenetic  tree  for  the  Drosophila 
obscura  group  by  using  the  genetic  distance  data  obtained  by  Lakovaara 
ct  al.  (1972b).  However,  estimates  of  genetic  distance  are  generally  subject 
to  a large  random  error  both  due  to  the  genetic  drift  in  the  past  evolutionary 
process  and  the  sampling  variation  at  the  time  of  gene  frequency  survey. 
If  wc  use  the  method  of  minimum  evolutionary  distance,  even  this  random 
error  is  regarded  as  reflecting  the  variation  of  the  rate  of  gene  substitution. 
Therefore,  the  tree  produced  could  be  quite  erroneous  unless  the  standard 
error  of  genetic  distance  is  reduced  to  a small  magnitude.  As  long  as  the 
standard  error  is  large,  it  seems  to  be  better  to  assume  a constant  rate  of  gene 
substitution.  In  fact,  in  the  case  of  the  tree  for  the  D.  obscura  group,  Lako- 
vaara et  al.’s  original  tree  based  on  this  assumption  appears  to  fit  the 
chromosomal  evolution  of  this  group  better  than  Farris'  (see  Lakovaara 
et  al.,  1974). 

In  table  7.5  the  estimates  of  genetic  distance  between  nine  species  of  lizards 
in  the  Anolis  roquet  group  (two  populations  in  one  species)  are  given  (Y ang 
et  al.,  1974).  This  group  of  Anolis  lizards  inhabit  a discrete  set  of  islands 
(the  Lesser  Antilles)  in  the  Caribbean  Sea.  The  estimates  of  genetic  distance 
are  based  on  gene  frequency  data  for  22  loci,  so  that  they  have  a rather 
large  standard  error.  Nevertheless,  it  is  clear  that  some  species  such  as 
aeneus,  extremus,  and  roquet  are  genetically  close,  while  species  luciae, 
blanquillanus , and  bonairensis  are  remotely  related  with  other  species.  The 
result  of  cluster  analysis  is  given  in  fig.  7.2  in  a form  of  phylogenetic  tree. 

As  expected,  the  two  populations  of  A.  aeneus  have  the  smallest  genetic 
distance  (0.004).  This  magnitude  of  distance  seems  to  be  reasonable,  since 
these  two  populations  have  been  separated  only  for  about  15,000  years  after 
the  rise  in  eustatic  sea  level.  It  is  known  that  aeneus,  extremus,  and  roquet 
have  the  chromosome  number  2 n = 34,  while  all  other  species  have  2n  = 36. 
Interestingly,  the  former  three  species  are  closely  related  at  the  gene  level. 
The  genetic  and  phylogenetic  relationships  among  the  nine  species  of  lizards 
become  clearer  if  we  know  the  geological  history  of  the  Lesser  Antilles.  The 
main  Lesser  Antillean  chain  has  been  emergent  for  no  more  than  1 1 million 
years,  while  the  Barbados  island  on  which  A.  extremus  lives  was  completely 
submerged  as  recently  as  a half  million  years  ago.  Using  this  information 
and  the  results  of  some  other  studies  on  the  morphology,  ecology,  and 
behavior  patterns  of  these  species,  Yang  et  al.  (1974)  have  made  an  interesting 
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inference  about  the  evolutionary  scheme  of  this  group  of  lizards,  starting 
from  the  invasion  from  South  America. 

In  recent  years  a number  of  authors  applied  the  genetic  distance  method  to 
produce  phylogenetic  trees.  They  are  generally  in  agreement  with  other 
evidence,  whenever  it  is  available.  For  example,  Nei  (1971a)  constructed  a 
phylogenetic  tree  for  nine  species  of  the  Drosophila  virilis  group  by  using 
electrophoretic  data  obtained  by  Hubby  and  Throckmorton  (1968).  The 
results  obtained  were  in  good  agreement  with  the  evolutionary  changes  of 
inversion  chromosomes  as  revealed  by  Stone  et  al.  (1960).  The  phylogenetic 
trees  based  on  genetic  distance  for  the  mesophragmatica  (Nair  et  al.,  1971), 
obscura  (Lakovaara  et  al.,  1972a),  and  affinis  (Lakovaara  et  al.,  1972b) 
groups  of  Drosophila  and  for  11  species  of  kangaroo  rats  (Johnson  and 
Selander,  1971)  are  all  compatible  with  their  chromosomal  evolution.  Levy 
and  Levin  (1974)  have  shown  that  the  evolutionary  scheme  of  the  Oenothera 
biennis  complex  revealed  by  enzyme  studies  agrees  fairly  well  with  Cleland’s 
(1972)  results  from  chromosomal  studies. 

The  genetic  distance  between  species  is  roughly  correlated  with  the  mor- 
phological difference.  However,  the  details  of  phylogenetic  trees  produced 
from  genetic  distances  often  disagree  with  those  based  on  morphological 
characters  (Lakovaara  et  al.,  1972a;  Johnson  and  Selander,  1971).  This  is  not, 
of  course,  unreasonable,  because  morphological  characters  may  be  changed 
considerably  by  a relatively  small  number  of  gene  substitutions. 


7.5  Mechanism  of  speciation 

The  plausible  process  of  species  formation  has  been  discussed  extensively 
by  Dobzhansky  (1951,  1970)  and  Mayr  (1963).  In  the  present  book  it  will 
suffice  to  discuss  only  the  essential  aspects  of  speciation. 

7.5.1  Classification  of  isolation  mechanisms 

As  mentioned  earlier,  for  a pair  of  populations  to  be  genetically  differentiated, 
they  must  be  completely  isolated  from  each  other.  This  isolation  may  occur 
geographically  or  reproductively.  There  are  many  different  mechanisms  for 
re-pro  due  ft ve  isolation.  Dobzhansky's  (1970)  classification  is  as  follows: 

1)  Premating  or  prezygotic  mechanisms  prevent  the  formation  of  hybrid 
zygotes. 
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a)  Ecological  or  habitat  isolation.  The  populations  concerned  occur  in 
different  habitats  in  the  same  general  region. 

b)  Seasonal  or  temporal  isolation.  Mating  or  flowering  times  occur  at 
different  seasons. 

c)  Sexual  or  ethological  isolation.  Mutual  attraction  between  the  sexes 
of  different  species  is  weak  or  absent. 

d)  Mechanical  isolation.  Physical  noncorrespondence  of  the  genitalia  or 
the  flower  parts  prevents  copulation  or  the  transfer  of  pollen. 

e)  Isolation  by  different  pollinators.  In  flowering  plants,  related  species 
may  be  specialized  to  attract  different  insects  as  pollinators. 

f)  Gametic  isolation.  In  organisms  with  external  fertilization,  female  and 
male  gametes  may  not  be  attracted  to  each  other.  In  organisms  with  internal 
fertilization,  the  gametes  or  gametophytes  of  one  species  may  be  inviable 
in  the  sexual  ducts  or  in  the  styles  of  other  species. 

2)  Postmating  or  zygotic  isolating  mechanisms  reduce  the  viability  of 
fertility  of  hybrid  zygotes. 

g)  Hybrid  inviability.  Hybrid  zygotes  have  reduced  viability  or  are 
inviable. 

h)  Hybrid  sterility.  The  F1  hybrids  of  one  sex  or  of  both  sexes  fail  to 
produce  functional  gametes. 

i)  Hybrid  breakdown.  The  Fz  or  backcross  hybrids  have  reduced  viability 
or  fertility. 

It  should  be  emphasized  that  any  reproductive  isolation  is  caused  by 
some  sort  of  genetic  differences  between  populations,  while  geographic 
isolation  may  occur  without  any  genetic  differences.  At  the  very  early  stage 
of  population  splitting,  there  should  not  be  any  substantial  gene  differences 
between  the  populations  formed.  At  this  stage,  therefore,  isolation  must  be 
geographical.  If  two  populations  are  geographically  isolated  for  a certain 
period  of  evolutionary  time,  they  would  accumulate  different  mutations  and 
reproductive  isolation  is  expected  to  be  gradually  developed.  Once  a mech- 
anism of  reproductive  isolation  is  established,  gene  exchange  no  longer 
occurs  between  the  two  populations  even  if  they  come  to  occupy  the  same 
geographic  area.  This  scheme  of  speciation  is  called  allopatric  speciation. 
Some  authors  (e.g.  Maynard  Smith,  1966),  however,  believe  that  under 
certain  conditions  speciation  may  occur  sympatrically,  i.e.,  in  the  same  area 
without  geographic  isolation.  Also,  in  plants  and  some  animals  auto- 
tetraploids  or  allotetraploids  may  be  produced  by  chromosome  doubling. 
Ui  this  case  the  new  polyploids  may  evolve  into  a new  species  sympatrically 
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because  of  the  immediate  establishment  of  reproductive  isolation  by  means 
of  different  chromosome  numbers. 

7.5.2  Evolution  of  reproductive  isolation 

In  any  organism  establishment  of  reproductive  isolation  is  the  crux  of 
speciation.  How  this  mechanism  has  evolved,  however,  is  not  well  understood 
except  in  some  special  cases.  Nevertheless,  it  seems  to  be  worthwhile  to 
speculate  on  some  possible  schemes  of  evolution  of  reproductive  isolation. 
It  would,  I hope,  stimulate  experimental  research  in  this  area. 

The  evolutionary  scheme  of  reproductive  isolation  would  vary  with 
different  isolating  mechanisms.  Ecological  and  seasonal  isolation  mechanisms 
may  be  developed  by  a single  gene  substitution,  though  generally  more  than 
one  gene  difference  would  be  involved.  Similarly,  isolation  by  different 
pollinators  may  evolve  by  a single  gene  substitution  in  the  host  plant.  It 
seems,  however,  that  for  the  evolution  of  ethological,  mechanical,  and 
gametic  isolations  more  than  two  gene  substitutions  are  required  except  in 
some  special  cases.  Similarly,  more  than  two  gene  substitutions  seem  to  be 
involved  in  the  evolution  of  postzygotic  isolating  mechanisms. 

One  possible  scheme  of  evolution  of  ethological  isolation  with  two  loci 
would  be  as  follows:  In  some  organisms  such  as  Drosophila  females  choose 
their  mates,  while  males  generally  do  not  have  any  mate  preference.  Suppose 
that  loci  A and  B control  male-limited  and  female-limited  morphological, 
physiological,  or  behavior  characters,  respectively,  and  that  the  original 
genotype  is  A0A0B0B0  for  both  males  and  females.  Mutant  gene  A , changes 
the  male  character,  while  mutant  B,  changes  the  female  character.  Because 
of  these  changed  characters,  B0Bl  or  BXB,  females  may  prefer  A,  A,  or  A,  A, 
males  rather  than  A0A0  males.  Namely,  assortative  mating  may  occur.  Then, 
A,  and  B,  may  be  jointly  fixed,  by  chance,  in  a finite  population  even  if  there 
is  no  fitness  difference  among  different  genotypes.  Of  course,  if  the  mating 
Ar  x Br  has  a higher  fertility,  then  the  fixation  of  A,  and  B{  genes  would 
be  accelerated.  If  another  descendant  population  still  has  genes  A,  and  B0 
or  new  mutant  genes  different  from  A,  and  B,,  then  the  two  populations 
will  manifest  ethological  isolation.  Essentially  the  same  evolutionary  scheme 
may  produce  mechanical  and  gametic  isolating  mechanisms.  The  important 
feature  of  this  scheme  of  evolution  is  that  the  fixation  of  mutant  genes  may 
occur  without  selection.  There  is  no  need  for  selection  favoring  ethological 
isolation  envisaged  by  Dobzhansky  and  Pavlovsky  (1971),  though  it  may 
happen  in  practice  (see  Muller,  1940). 
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In  the  evolution  of  postzygotic  isolating  mechanisms  several  epistatic  gene 
loci  for  fitness  seem  to  be  involved,  though  it  is  not  impossible  for  a single 
locus  to  establish  reproductive  isolation.  Dobzhansky  (1951)  has  suggested 
the  following  scheme.  Consider  two  loci  (or  two  sets  of  loci)  which  control 
some  type  of  postzygotic  reproductive  isolation,  and  let  A0A0B0B0  be  the 
genotype  for  these  loci  of  the  foundation  stock  from  which  populations  1 and 
2 are  derived.  If  these  two  populations  are  geographically  isolated,  it  is 
possible  that  in  population  I A,  mutates  to  A , and  this  mutant  gene  may  be 
fixed  in  the  population  by  chance,  provided  that  A0AXB0B0  and  A, A,  B0B0 
are  as  fertile  (or  viable)  as  A0A0B0B0.  Similarly,  in  population  2 mutation 
may  occur  at  the  B locus  and  genotype  A0A0B0B0  may  be  replaced  by 
A0A0B2B2  without  loss  of  fertility.  However,  if  there  is  gene  interaction 
such  that  any  combination  of  mutant  genes  A,  and  B2  results  in  sterility 
or  inviability,  the  hybrids  (A 0AlB0B2)  between  the  two  populations  will  be 
infertile  or  inviable. 

A possible  explanation  of  this  scheme  at  the  molecular  level  is  as  follows: 
Let  ot0,  a1,  jS°,  and  fi2  be  the  polypeptides  produced  by  genes  A„  A,,  B0 , 
and  B2,  respectively,  and  suppose  that  each  locus  produces  a protein 
composed  of  two  polypeptides.  Thus,  in  the  hybrids  the  A locus  would 
produce  proteins  a°a°,  a°a1,  and  o^a1  in  the  ratio  1:2:1,  while  the  B locus 
would  produce  /?°/?°,  P°(32,  and  /?2/?2  in  the  same  ratio.  If  the  functions  of 
a°a1  and  axal  are  incompatible  with  those  of  ft^ft2  and  /?2/?2  or  vice  versa, 
then  hybrid  inviability  or  sterility  may  result.  In  this  case  there  is  no  adverse 
interaction  between  a°a°  and  or  /?2/?2  or  between  and  a°ax 

or  a1a1.  Therefore,  the  hybrid  inviability  or  sterility  may  not  be  complete. 
However,  if  one  more  mutation  is  fixed  in  each  population,  so  that  the 
genotypes  of  populations  1 and  2 become  A1AlB1Bl  and  A2A2B2B2, 
respectively,  then  postzygotic  isolating  mechanism  would  be  completed. 

In  the  above  scheme  we  assumed  that  the  genotypes  A0A0B2B2  and 
A1AXB0B0  are  as  fertile  as  A0A0B0B0.  We  note,  however,  that  in  small 
populations  even  slightly  deleterious  mutations  as  well  as  neutral  or  ad- 
vantageous mutations  may  be  fixed  in  the  population  (ch.  5).  Thus,  the 
mutant  genes  A,  and  B,  themselves  may  be  slightly  deleterious.  In  this  case 
the  mean  fitness  of  the  population  would  be  reduced  to  a slight  degree  after 
fixation  of  these  genes.  However,  it  would  not  seriously  threaten  the  survival 
of  the  population  if  the  next  mutant  genes  to  be  fixed  are  advantageous  and 
restore  the  population  fitness.  If  this  process  of  fixation  of  negative  and 
positive  mutation  is  repeated,  then  we  would  expect  that  a system  of  co- 
adapted genes  is  developed  within  each  of  the  isolated  populations  and  the 
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hybrids  between  them  will  show  poor  viability  and  fertility.  Since  in  small 
populations  various  kinds  of  mutations  from  slightly  deleterious  to  advan- 
tageous ones  may  be  fixed,  the  development  of  reproductive  isolation  will 
be  faster  when  population  size  is  small  than  when  it  is  large. 

Although  there  is  no  direct  evidence  for  the  above  scheme  of  evolution, 
gene  interaction  between  two  or  more  loci  seems  to  be  a necessary  condition 
for  reproductive  isolation.  In  fact,  most  genetic  studies  on  intersubspecific 
and  interspecific  inviability  or  sterility  supports  this  view.  For  example, 
Oka  (1974)  identified  more  than  two  complementary  genes  controlling 
the  hybrid  sterility  between  two  subspecies  of  rice,  Oryza  sativa  japonica 
and  O.  s.  indica.  Also,  Prakash  (1972)  showed  that  the  sterility  of  F,  males 
obtained  from  the  cross  between  females  from  Bogota  (Colombia)  and  males 
from  the  United  States  mainland  in  Drosophila  pseudoobscura  can  be  ex- 
plained by  the  interaction  between  two  loci  on  the  X chromosome  and  one 
locus  on  each  of  two  autosomes.  In  this  case  females  from  the  same  cross 
and  T7!  males  and  females  from  the  reciprocal  cross  are  fully  fertile.  So,  even 
in  simple  reproductive  isolation,  a number  of  loci  seem  to  be  involved.  The 
number  of  loci  concerned  with  interspecific  reproductive  isolation  appears  to 
be  considerably  large.  This  is  true  at  least  in  the  case  of  hybrid  sterility 
between  Drosophila  pseudoobscura  and  D.  persimilis,  where  testis  size  of 
hybrid  males  is  controlled  by  at  least  eight  loci  distributed  on  the  X,  second, 
third,  and  fourth  chromosomes  (Dobzhansky,  1936). 

In  some  cases  the  interaction  between  cytoplasm  and  nuclear  genes  plays 
an  important  role  in  developing  reproductive  isolation,  as  shown  by  Mi- 
chaelis  (1954)  in  the  species  of  Epilobium  and  by  Kihara  (1959)  in  the  cross 
between  Triticum  vulgare  x Aegilops  caudata.  In  some  other  cases  the  inter- 
action between  the  Y chromosome  and  autosomes  seems  to  be  important 
(Patterson  and  Stone,  1952).  The  evolutionary  scheme  of  these  reproductive 
isolations,  however,  seems  to  be  essentially  the  same  as  that  discussed  above. 

Examining  data  on  interspecific  hybridization,  Haldane  (1922)  noticed 
that  in  organisms  with  differentiated  sex  chromosomes  hybrid  inviability 
or  sterility  is  generally  expressed  more  frequently  in  the  heterogametic  sex 
than  in  the  homogametic  sex.  Thus,  in  Drosophila  F,  males  are  more  often 
inviable  or  sterile  than  Fl  females,  while  in  silkworms  the  situation  is  reversed. 
This  property  is  often  called  Haldane's  rale.  This  rule  was  first  explained  by 
complementary  gene  action  of  X-linked  genes  with  autosomal  genes  (Hal- 
dane, 1922;  Muller,  1940).  In  interspecific  hybridization  the  homogametic 
F,  receives  one  X chromosome  and  one  set  of  autosomes  from  each  of  the 
parental  species,  while  in  the  heterogametic  sex  the  X chromosome  from  one 
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parental  species  is  missing  although  both  sets  of  autosomes  are  fully  re- 
presented. Thus,  the  autosomal  genes  which  are  complementary  to  the  genes 
on  the  missing  X chromosome  will  not  function  normally  in  the  hctero- 
gametic  sex.  This  would  result  in  heterogametic  inviability  or  sterility. 

Later,  however,  Haldane  (1932)  abandoned  this  genic  imbalance  theory, 
and  preferred  an  explanation,  which  was  termed  the  chromosome  imbalance 
theory  by  Tracey  (1972).  This  theory  is  based  on  Stern's  (1929)  experiments 
with  X-Y  translocations  in  Droxopfiifa  m^fanogaster  Stern  produced  an 
X-  Y translocation  stock  in  which  one  arm  of  the  Y chromosome  was  carried 
by  the  X chromosome  and  the  Y lacked  the  arm  carried  by  the  X-  Y chromo- 
some. Since  all  the  Y chromosome  genes  were  present,  this  stock  was  fully 
fertile.  However,  crosses  between  males  from  this  stock  and  females  from  a 
normal  stock  produced  sterile  F,  males.  The  sterility  of  the  F{  males  was 
due  to  the  absence  of  genes  required  for  sperm  motility  which  were  carried 
by  the  Y chromosome  arm  translocated  to  the  X Interestingly,  Muller  (1 940) 
rejected  this  second  explanation  and  preferred  Haldane’s  first  hypothesis.  In 
practice,  however,  the  two  types  of  mechanisms  are  not  mutually  exclusive 
and  both  seem  to  be  responsible  for  heterogametic  inviability  or  sterility 
(see  Tracey,  1972). 

7.5.3  //tm'fast  is  reproductive  isolation  established? 

An  important  question  about  speciation  is:  How  fast  does  a new  species 
emerge?  This,  of  course,  depends  on  how  fast  new  mutations  controlling 
reproductive  isolation  occur  and  are  fixed  in  the  population.  In  general, 
it  seems  to  take  a long  time,  though  it  would  vary  considerably  in  individual 
cases.  We  have  seen  that  some  pairs  of  subspecies,  which  are  not  yet  repro- 
ductively  isolated,  have  a much  larger  genetic  distance  than  some  pairs  of 
species  which  are  already  reproductively  isolated.  The  estimates  of  inter- 
subspecific genetic  distance  indicate  that  reproductive  isolation  may  not 
be  developed  even  if  the  genetic  distance  is  as  high  as  0.3  (possibly  corre- 
sponding to  an  evolutionary  time  of  about  1.5  million  years).  In  the  case  of 
Drosophila  pseudoobscura  and  D.  persimilis , however,  reproductive  isolation 
has  been  established  even  if  genetic  distance  is  only  0.05  (possibly  about 
250,000  years).  This  large  variation  in  genetic  divergence  that  occurs  (or 
evolutionary  time  that  elapses)  before  the  establishment  of  reproductive 
isolation  is,  of  course,  understandable,  since  reproductive  isolation  may  be 
completed  by  a small  number  of  gene  substitutions.  Zouros  (1973)  has  shown 
that  the  correlation  between  genetic  divergence  and  index  of  fertile  hybrid 
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production  in  closely  related  species  of  Drosophilu  is  rather  small  (see  also 
Richmond,  1972b).  In  frogs  Wilson  et  al.  (1974)  have  shown  that  two  species 
which  are  capable  of  producing  hybrids  often  have  a large  genetic  divergence 
comparable  to  that  between  different  orders  of  mammals. 

The  degree  of  reproductive  isolation  between  two  taxa  is  also  not  correlated 
with  morphological  divergence.  Thus,  some  pairs  of  subspecies  or  species 
show  a considerable  amount  of  morphological  differences,  yet  they  can 
produce  completely  fertile  hybrids  when  crossed  artificially.  On  the  other 
hand,  many  sibling  species  in  Drosophila  are  morphologically  indistinguishable 
or  distinguishable  with  difficulty  but  do  not  produce  fertile  hybrids.  Clearly, 
the  genes  controlling  reproductive  isolation  manifest  few  morphological 
effects. 

The  usual,  and  by  now  orthodox,  view  of  speciation  is  that  it  occurs  by 
slow  genetic  divergence,  and  subsequent  reproductive  isolation,  of  geo- 
graphically separated  and  differentially  adapted  races  or  subspecies  (Dobz- 
hansky,  1972).  This  implies  that  there  must  be  some  adaptive  differences 
between  races  or  subspecies  before  reproductive  isolation  occurs.  Recently, 
Carson  (1970,  1971,  1973)  proposed  a hypothesis  that  speciation  may  occur 
without  any  prior  adaptive  divergence  within  a relatively  small  number  of 
generations.  This  hypothesis  is  based  on  his  studies  on  Hawaiian  Drosophila 
species,  many  of  which  apparently  evolved  very  rapidly  by  colonizing 
various  niches  on  newly  formed  islands.  Studying  cytogenetic,  morpho- 
logical, and  biogeographical  properties  of  these  species,  he  came  to  the 
conclusion  that  each  species  on  an  island  is  probably  descended  from  a single 
gravid  female  that  migrated  from  the  donor  island.  Carson  (1973)  argues 
that  if  a species  starts  from  a single  inseminated  female,  a strong  founder 
effect  may  occur  and  this  would  result  in  a catastrophic  reorganization  of  the 
gene  pool  in  the  presence  of  epistatic  gene  interaction.  He  states  that  the 
founder  effect  alone  is  not  sufficient  for  such  a reorganization  to  occur;  the 
original  founder  female  must  be  derived  from  a population  which  has 
recently  undergone  a rapid  explosion  or  flush.  The  reason  for  this  is  that 
such  a population  flush  with  relaxation  of  selection  may  produce  a rare  gene 
combination  at  epistatic  loci.  Apparently,  he  is  thinking  jo'mi  fixation  of 
coadapted  genes  in  the  population. 

This  theory,  however,  has  some  difficulties.  First,  the  assumption  that 
selection  is  relaxed  during  population  flush  but  resumed  after  colonization 
is  completed  is  unlikely.  Second,  even  if  this  assumption  is  satisfied,  the 
probability  of  joint  fixation  of  coadaptive  genes  is  extremely  small  (Crow 
and  Kimura,  1965;  Ohta,  1968).  Nevertheless,  small  populations  seem  to  be 
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favorable  for  a rapid  evolution  of  reproductive  isolation.  Coadaptive  genes 
need  not  be  fixed  jointly  but  can  be  fixed  successively,  as  discussed  earlier. 
In  our  evolutionary  scheme,  no  population  flush  is  required  cither.  The  evolu- 
tion of  reproductive  isolation  is  a post-isolation  event.  In  this  connection  it 
is  interesting  to  note  that  rapid  evolution  in  the  past  seems  to  have  occurred 
almost  always  when  population  size  was  small  (Simpson,  1953). 

An  apparently  rapid  establishment  of  male  hybrid  sterility  in  laboratory 
populations  was  recently  reported  by  Dobzhansky  and  Pavlovsky  (1971) 
in  a strain  of  Drosophila  paulistorum.  This  strain  was  descended  from  a single 
inseminated  female  captured  in  the  Llanos  of  Colombia  in  March,  1958. 
When  tested  in  1958,  this  produced  fertile  hybrids  with  the  Orinocan  sub- 
species and  was  classified  as  a strain  of  this  subspecies.  In  the  test  conducted 
in  1963,  however,  it  produced  sterile  male  hybrids  when  crossed  with  Ori- 
nocan. Dobzhansky  (1972)  gives  three  possible  explanations,  including  the 
effect  of  the  cytoplasmic  symbionts  which  may  cause  male  sterility,  but  none 
of  them  has  yet  been  substantiated.  Obviously,  a detailed  study  of  the  genetic 
mechanism  of  this  male  sterility  should  be  conducted. 

Another  possible  example  of  rapid  development  of  male  hybrid  sterility 
was  reported  by  Prakash  (1972)  in  D.  pseudoobscura.  As  mentioned  earlier, 
the  male  hybrid  sterility  in  the  cross  between  the  Bogota  and  North  American 
strains  in  this  species  are  controlled  by  at  least  four  loci.  Prakash  states  that 
the  Bogota  population  was  introduced  apparently  very  recently  from  a 
Central  or  North  American  population,  since  before  1960  no  one  had  ob- 
served this  species  in  the  Bogota  area.  If  this  is  true,  the  male  hybrid  sterility 
must  have  developed  in  about  100  generations  by  the  substitution  of  at  least 
four  sterility  genes.  If  this  really  occurred,  it  is  unusually  rapid  evolution. 
The  levels  of  average  heterozygosity  and  average  number  of  alleles  per  locus 
in  the  Bogota  population  seem  to  support  this  hypothesis  (Nei  et  al.,  1975). 
At  this  moment,  however,  there  is  no  way  to  prove  that  D.  pseudoobscura 
was  really  introduced  into  the  Bogota  area  around  1960  (Dobzhansky,  1973). 
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Long-term  evolution 


In  the  preceding  chapters  we  were  mainly  concerned  with  the  change  in  gene 
frequency  in  populations  and  the  processes  leading  to  speciation.  In  the 
present  chapter  we  shall  discuss  long-term  evolution  by  comparing  DNA, 
RNA,  and  proteins  from  remotely  related  organisms.  In  the  last  decade 
rapid  progress  has  been  made  in  this  area,  and  a large  body  of  experimental 
data  and  their  implications  for  organic  evolution  have  been  discussed  in 
Dayhoffs  (1972)  book  'Atlas  of  Protein  Sequence  and  Structure'.  In  the 
present  book,  therefore,  we  shall  discuss  only  the  main  results  and  their 
bearings  on  the  mechanism  of  evolution. 


8.1  Evolutionary  change  of  DNA 


8,1.1  DNA  content 

During  the  evolutionary  process  DNA  content  has  increased  considerably, 
as  will  be  seen  from  fig.  8.1.  Although  the  present  viruses  would  not  re- 
present the  oldest  form  of  organism,  some  viruses  such  as  </>X174  and  FI 
have  a DNA  content  comprised  of  only  six  to  eight  genes  (about  6000 
nucleotides  long).  On  the  other  hand,  mammalian  species  have  about 
3 x 109  nucleotide  pairs  per  haploid  genome,  which  is  equivalent  to  about 
three  million  genes  if  all  DNA's  are  informational.  This  increase  in  DNA 
content  was  clearly  important  for  organisms  to  evolve  from  simpler  to 
complex  forms.  For  a highly  ordered,  complex  organism  to  maintain  its 
life,  a large  number  of  genes  are  required.  In  fact,  there  are  many  genes 
which  exist  only  in  higher  organisms.  For  example,  the  genes  for  hemoglobin, 
haptoglobin,  and  immunoglobins  exist  only  in  higher  organisms. 
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Fig.  8.1.  The  minimal  amount  of  DNA  that  has  been  observed  for  various  species  in  the 
types  of  organisms  listed.  Each  point  represents  the  measured  DNA  content  per  cell  for  a 
haploid  set  of  chromosomes.  The  ordinate  scale  and  the  shape  of  the  curve  is  arbitrary. 
From  Britten  and  Davidson  (1969),  reprinted  by  permission,  The  American  Association 
for  the  Advancement  of  Science,  © 1969. 


Table  8.1 


DNA  contents  of  various  organisms. 


Organism 

Nucleotide  pairs 
per  genome 

Organism 

Nucleotide  pairs 
per  genome 

Mammals 

3.2  x 10" 

Fruit  fly 

0.1  x 10“ 

Birds 

1.2  x 109 

Maize 

7 x 109 

Lizards 

1.9  x 109 

Neurospora 

4 x 107 

Frogs 

6.2  x 10* 

E.  coli 

4 x 10® 

Most  bony  fish 

0.9  x 10" 

T 4 phage 

2 x 105 

Lungfish 

111.7  x 10® 

A phage 

1 x 105 

Echinodcrm 

0.8  x ]Qn 

0X174 

6 x 10a 
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However,  a close  examination  of  the  genome  sizes  of  various  organisms 
shows  that  DNA  content  is  not  necessarily  correlated  with  the  complexity  of 
organism  (table  8.1).  This  has  been  confirmed  by  Bachmann  et  al.  (1972) 
and  Sparrow  et  al.  (1972)  in  surveys  of  the  DNA  contents  of  a large  number 
of  animals  and  plants.  For  example,  a species  of  lungfish  has  a DNA  content 
about  40  times  higher  than  mammalian  DNA.  Many  amphibians  also  have 
a larger  amount  of  DNA  than  mammalian  species.  Thus,  a large  amount  of 
DNA  content  itself  is  not  sufficient  to  produce  a complex  organism.  For  a 
complex  organism  to  be  produced,  there  must  be  a sufficiently  large  number 
of  different  genes  in  the  genome.  At  the  present  time  we  do  not  know  the 
number  of  different  kinds  of  genes  in  a genome  except  in  some  micro- 
organisms. 

8.1.2  Evolutionary  mechanisms  cf  increase  in  DNA  content 

The  large  amounts  of  DNA  contents  in  higher  organisms  are  believed  to 
have  occurred  mainly  by  gene  duplication  in  the  evolutionary  process.  There 
are  two  types  of  gene  duplication.  One  is  chromosome  duplication,  and  the 
other  is  the  duplication  of  a small  segment  of  chromosome  (tandem  duplica- 
tion) by  unequal  crossing  over.  A common  type  of  chromosome  duplication 
is  genome  duplication.  As  seen  from  table  8.1,  the  mammalian  DNA  is 
about  1000  times  greater  than  the  Escherichia  coli  DNA.  If  the  increase  in 
DNA  content  is  entirely  due  to  genome  duplication,  there  must  have  been 
about  ten  (210  a;  1000)  genome  duplications  from  bacteria  to  mammals.  If 
bacteria  evolved  about  3 x 109  years  ago  (ch.  2),  the  genome  duplication 
must  have  occurred  on  the  average  once  in  3 x U)a  years  (Nei,  1969a).  On 
the  other  hand,  if  DNA  content  increases  continuously  by  unequal  crossing 
over,  the  rate  of  increase  may  be  expressed  as 

dn/dt  = kn,  (£.1) 

where  n is  the  total  number  of  nucleotide  pairs  in  DNA,  t is  the  time  in 
years,  and  k is  a constant.  Solution  of  this  equation  gives  n = n0exp(kt), 
where  is  the  initial  DNA  content.  From  bacteria  to  mammals  the  DNA 
increased  1000  times  in  about  3 x 109  years.  Therefore,  k is  estimated  to  be 
2.3  x I0-9.  This  means  that  the  DNA  content  comparable  to  that  of 
mammals  would  increase  by  an  average  of  seven  nucleotide  pairs  per  year. 

In  plant  evolution  genome  duplication  or  polyploidization  played  an 
important  role,  as  documented  by  Stebbins  (1950).  In  animals,  it  was 
customary  in  the  past  to  assume  that  the  major  mechanism  responsible  for 
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the  increase  in  genetic  material  was  unequal  crossing  over  (Bridges,  1936). 
However,  recent  studies  of  nuclear  DNA  content  indicate  that  its  variation 
among  different  organisms  is  rather  discrete.  Therefore,  genome  duplication 
seems  to  have  been  quite  important  in  the  evolution  of  animals.  From  the 
results  of  cytological  and  biochemical  studies,  Ohno  (1967,  1970)  concludes 
that  at  least  one  polyploidization  occurred  in  the  mammalian  lineage  about 
300  million  years  ago  in  the  stage  of  fish.  He  believes  that  genome  duplication 
was  quite  common  in  animal  evolution  before  sex  chromosomes  were 
differentiated.  Once  the  differentiation  of  sex  chromosomes  was  completed 
in  the  mammalian,  avian,  and  reptilian  lineages,  genome  duplication  seems 
to  have  disrupted  the  mechanism  of  sex  determination  and  thus  the  resulting 
tetraploid  was  almost  immediately  obliterated  (Muller,  1925).  In  most  fish 
and  amphibians  the  sex  chromosomes  have  not  yet  been  established  and  the 
tetraploid  males  and  females  can  be  maintained  without  much  difficulty 
(Ohno,  1967).  In  fact,  Becak  et  al.  (1966)  discovered  a bisexual  tetraploid 
species  of  frog  in  South  America. 

Tandem  duplication  by  unequal  crossing  over  was  apparently  equally 
important  in  organic  evolution.  Genes  controlling  the  same  or  similar 
functions  are  often  closely  linked.  For  example,  about  100  duplicate  genes 
for  ribosomal  RNA  are  clustered  in  the  nucleolar  organizer  region  of  each 
of  the  X and  Y chromosomes  in  Drosophila  melunogaster  (Ritossa  and 
Spiegelman,  1965.  Similarly,  homologous  genes  coding  for  several  im- 
munoglobulin polypeptides  are  also  closely  linked.  A further  example  is 
the  close  linkage  between  the  genes  for  the  ji-  and  5-chains  of  human 
hemoglobin  (Boyer  et  al.,  1963).  The  evolution  of  these  closely  linked 
homologous  genes  can  best  be  explained  by  tandem  duplication.  Horowitz 
(1965)  and  Lewis  (1967)  postulate  that  operons  in  bacteria  have  also  evolved 
by  a process  of  repeated  tandem  duplications  accompanied  by  gradual 
functional  differentiation  of  the  daughter  genes,  though  in  this  case  the 
homology  of  structural  genes  of  an  operon  has  yet  to  be  confirmed. 

8.1.3  Formation  of  new  genes 

I)  Complete  gene  duplication 

If  two  duplicate  genes  are  produced  from  a gene,  one  of  them  may  mutate 
drastically  and  become  an  entirely  different  gene  in  function.  The  simplest 
way  to  determine  whether  a pair  of  genes  have  descended  from  a common 
ancestor  is  to  examine  the  nucleotide  sequences  of  the  genes  or  the  amino 
acid  sequences  of  the  proteins  coded  for  by  the  genes.  In  fact,  by  examining 
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Table  H.2 


Extents  of  divergence  and  functional  differences  between  proteins  derived  from  gene 
duplications.  Chemical  activities  include  differences  in  catalytic  action  and  in  binding 
to  substrates,  inhibitors,  antigens,  etc.  From  Dayhoff  and  Barker  (1972). 


Proteins 

Amino 

acid 

diff. 

(%) 

Divergence 
time 
(10«  yr) 

Chemical 

activities 

Aggregation 
propel  ties 

Action 

sites 

Hemoglobin-  myoglobin 

77 

1100 

- 

+ + 

+ 

Growth  hormone-prolactin 

77 

200 

+ 

— 

+ 

Immunoglobulin  heavy  and 

light  chains 

75 

400 

+ + 

-1- 

— 

Immunoglobulin  (i-  and  y- 

chain  C regions 

70 

m 

4- 

+ 

+ 

Thyrotropin  and  luteinizing 

hormone  @ -chains 

69 

+ 

— 

+ 

Trypsin- thrombin 

63 

1500 

+ 

- 

Lactalbumin-lysozyme 

61 

350 

— 

+ 

Immunoglobulin  k-  and  4- 

chain  C regions 

62 

300 

— 

— 

— 

Basic  and  colostrum 

trypsin  inhibitors 

60 

— 

— 

+ 

Hemoglobin  a-  and  /5-chains 

59 

600 

— 

+ 

— 

Glucagon-secret  in 

52 

4- 

- 

+ 

Hemoglobin  /?-  and  y-chains, 

human 

27 

130 

— 

— 

— 

Protamines,  salmine  AI 

and  AII 

22 

100 

- 

— 

Chymotrypsin  A and  B 

21 

270 

- 

— 

- 

Growth  hormone-lactogen 

15 

23 

4- 

— 

+ 

Hemoglobin  /?-  and  ^-chains. 

human 

* 

40 

— 

— 

Alcohol  dehydrogenase 

1.7 

4- 

- 

— 

E-  and  S-chains 

++  Very  different,  * Different,  - Similar. 


the  amino  acid  sequences  of  myoglobin  and  the  a-,  /?->  and  y-chains  of  hemo- 
globins in  man,  Ingram  (1961,  1963)  was  able  to  show  that  the  genes  respon- 
sible for  the  three  chains  of  human  hemoglobin  were  produced  by  gene 
duplication.  Comparison  of  the  three  chains  indicates  that  the  proportion 
of  common  amino  acids  between  the  a-  and  /I-chainsis  as  high  as  41  percent, 
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while  that  between  /?-  and  y-chains  is  even  higher  (73  percent)  (table  8.2). 
These  similarities  are  so  high,  that  the  probability  that  the  similarities  are 
due  to  chance  is  negligible.  Ingram  further  showed  that  the  human  myo- 
globin has  also  originated  from  the  same  common  ancestor  as  that  for  the 
three  chains  of  hemoglobin. 

After  Ingram’s  study,  many  examples  of  formation  of  new  genes  by  gene 
duplication  were  discovered.  Table  8.2  gives  some  typical  examples.  The 
approximate  time  of  divergence  for  each  pair  of  homologous  proteins  was 
computed  from  the  similarity  of  amino  acid  sequence  by  a method  similar  to 
that  discussed  in  ch.  2.  It  is  seen  that  protein  function  is  considerably 
differentiated  between  some  pairs  of  homologous  proteins  such  as  hemo- 
globin and  myoglobin,  while  some  pairs  of  proteins  such  as  the  human 
hemoglobin  /?-  and  8-chains  still  maintain  essentially  the  same  function.  The 
human  /?-  and  b-chains  are  apparently  interchangeable,  since  the  proportion 
of  hemoglobin  in  adults  varies  considerably  among  individuals  without 
any  noticeable  effect.  It  is  also  noted  that  the  pairs  of  homologous  proteins 
between  which  the  amino  acid  sequences  differ  by  more  than  50  percent 
generally  have  different  functions.  On  the  other  hand,  there  is  little  functional 
differentiation  between  a pair  of  proteins  where  the  sequence  differences 
are  less  than  15  percent. 

Under  certain  conditions,  however,  a gene  of  new  function  may  be  formed 
through  a relatively  small  number  of  mutational  steps.  This  occurs  par- 
ticularly when  the  substrates  of  the  original  and  mutant  enzymes  are  closely 
related.  The  normal  strain  of  Pseudomonas  aeruginosa  uses  acetamide  and 
propionamide  as  a source  of  nitrogen  but  not  valeramide  and  phenylacetamide. 
By  exposing  this  strain  to  mutagenic  agents  and  conducting  artificial 
selection,  however,  Betz  et  aJL  (1974)  produced  a number  of  mutant  strains 
which  can  utilize  valeramide  or  phenylacetamide.  Studies  on  the  biochemical 
properties  of  the  new  enzymes  produced  have  suggested  that  only  a few  steps 
of  mutational  changes  were  involved  in  the  formation  of  the  new  genes. 

Gene  duplication  seems  to  be  occurring  even  at  the  present  time.  Schroeder 
et  al.  (1968)  have  shown  that  the  human  genome  has  at  least  two  nonallelic 
genes  for  the  y-chain,  which  produce  different  amino  acids  at  the  136th 
amino  acid  position.  Also,  there  seem  to  be  two  a-chains  coding  for  identical 
chains  in  the  human  genome. 

Campbell  et  al.  (1973)  and  Hall  and  Hartl  (1974)  reportedexperiments  with 
Escherichia  co/i  in  which  mutant  strains  with  deletion  of  the  /i-galactosidase 
gene  (lac  Z)  reacquired  the  ability  to  hydrolize  /i-galactosides  during 
prolonged  intense  selection  for  growth  on  lactose.  Clearly,  a new  gene  for 


Evolutionary  change  of  DNA 


217 


/Lgalactosidase  evolved.  This  new  gene  was  shown  to  be  located  almost 
exactly  opposite  from  the  location  of  the  ordinary  /Lgalactosidase  gene  (the 
lactose  operon)  in  the  circular  linkage  map  of  E.  coli.  It  is  not  known  which 
gene  of  the  original  lac  deletion  strain  has  been  developed  into  the  new 
/Lgalactosidase  gene,  but  it  is  probable  that  the  new  gene  is  evolutionarily 
homologous  to  the  ordinary  /Lgalactosidase  gene. 

2)  Gene  elongation 

Like  hemoglobin,  haptoglobin  is  composed  of  two  a-chains  and  two  />- 
chains.  There  are  two  types  of  a-chains  in  human  haptoglobin,  a'  and  a2. 
Furthermore,  two  forms  of  haptoglobin  a1  are  known,  called  fast  (F)and 
slow  (S).  The  difference  between  these  two  forms  is  attributable  to  the 
amino  acid  at  position  54,  lysine  (F)  and  glutamic  acid  (S).  Studies  on  amino 
acid  sequences  have  shown  that  the  a2  (143  amino  acids)  is  nearly  twice  as 
long  as  the  a'  chain  (84  amino  acids)  and  consists  of  portions  of  the  F and 
S forms  of  the  cf1  chain.  Thus,  it  is  clear  that  the  a2  gene  is  a product  of 
unequal  crossing  over  within  a gene,  which  occurred  between  the  F and  S 
allelic  genes  in  a heterozygote.  Since  the  a2  gene  is  apparently  present  only  in 
man  and  no  amino  acid  difference  is  observed  between  the  homologous  parts 
of  a1-  and  a2-chains,  the  unequal  crossing  over  must  have  occurred  very  re- 
cently. The  a'  and  a2  genes  behave  as  alleles  and  the  frequency  of  a2  is  30  ~ 
70  percent  in  human  populations.  Black  and  Dixon  (1968)  have  suggested  that 
the  a2-chain  may  have  selective  advantage  over  the  a'-chain,  since  it  is  more 
efficientthan  the  a*  in  rendering  the  heme  group  susceptible  to  degradation. 
At  any  rate,  if  the  a2  gene  replaces  the  a'  gene,  man  will  have  a longer  gene 
for  the  a-chain  than  other  organisms.  Similar  examples  of  gene  elongation 
are  observed  in  bacterial  ferredoxin,  bacterial  cytochrome  c3,  vertebrate 
immunoglobulin  y-chain,  and  lima  bean  protease  inhibitor  (see  Dayhoff, 
1972). 

3)  Hybrid  genes 

Gene  duplication  by  unequal  crossing  over  may  occur  in  a DNA  region 
including  two  genes.  This  may  produce  a new  gene  which  consists  of  parts 
of  two  consecutive  genes.  A good  example  of  this  type  of  new  gene  is  the 
Lepore  hemoglobin  gene  in  man.  This  gene  is  composed  of  parts  of  the  f- 
and  6-chain  genes  (Baglioni,  1962).  This  type  of  unequal  crossing  over  seems 
to  occur  rather  frequently,  since  there  are  already  11  different  types  of  Lepore 
hemoglobins  reported.  This  high  frequency  of  unequal  crossing  over  in  the 
fl  and  6 gene  region  is  of  course  attributable  to  the  close  linkage  of  the  and  6 
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genes,  the  latter  itself  being  a product  of  unequal  crossing  over.  It  has  long 
been  known  from  the  study  of  the  Bar  locus  in  Drosophila  that  the  duplicate 
gene  region  is  very  unstable,  probably  because  the  homologies  both  between 
and  within  genes  disturb  chromosomal  (DNA)  pairing  in  meiosis. 

In  practice,  however,  such  hybrid  genes  as  the  above  seem  to  have  some 
deleterious  effect,  unless  the  original  genes  are  retained  together  with  the 
hybrid  genes.  Thus,  the  Lepore  hemoglobin  genes  are  kept  in  low  frequency. 
On  the  other  hand,  if  the  original  genes  are  retained,  the  hybrid  gene  may 
evolve  into  a new  gene.  One  such  example  is  the  clupeine  Z gene  in  herring, 
which  probably  arose  through  a crossing  over  between  the  clupeines  Y 1 and 
711  genes.  Fitch  (1971a)  has  shown  that  the  probability  that  these  three 
genes  arose  by  simple  duplications  and  subsequent  amino  acid  substitution 


Fig.  8.2.  Spectrogram  of  the  frequency  of  repetition  of  nucleotide  sequences  in  the  DNA 
cf  the  mouse.  Relative  quantity  of  DNA  plotted  against  the  logarithm  of  the  repetition 
frequency.  The  dashed  segments  of  the  curve  represent  regions  of  considerable  uncertainty. 
From  Britten  and  Kohne  (1968),  reprinted  by  permission,  The  American  Association  for 
the  Advancement  of  Science,  © 1968. 
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is  very  small.  It  is  also  possible  that  the  /?A-chainof  sheep  hemoglobin  was 
produced  by  unequal  crossing  over  between  the  [IB  and  jtC  genes. 

8.1,4  Repealed  DNA 

Recent  studies  of  DNA  chemistry  have  shown  that  the  genome  of  higher 
organisms  contains  various  classes  of  highly  repeated  DNA.  This  was  first 
discovered  by  Waring  and  Britten  (1 966)  in  an  investigation  of  denaturation 
and  reassociation  of  DNA  molecules  from  the  house  mouse.  Studying  the 
speed  of  DNA  reassociation,  they  concluded  that  the  mouse  DNA  contains 
a short  nucleotide  sequence  (about  300  base  pairs  long)  present  in  about  one 
million  copies.  Later,  Britten  and  Kohne  (1968)  showed  that  virtually  all 
eukaryotic  organisms  contain  a fraction  of  repeated  DNA.  This  repeated 
DNA  is  sometimes  called  satellite  DNA,  since  this  often  forms  a satellite 
band  when  the  total  DNA  is  fractionated  on  the  basis  of  nucleotide  com- 
position by  the  CsCl  centrifugation.  The  total  amount  of  repeated  DNA  in 
a genome  varies  with  organism  but  constitutes  5 to  60  percent  of  the  total 
DNA.  The  repeated  DNA  generally  comprises  many  different  sets  of 
multiple  copies  of  nucleotide  sequences,  as  shown  in  fig.  8.2.  The  number 
of  multiple  copies  of  nucleotide  sequence  also  varies  with  organism.  The 
number  of  copies  of  a particular  sequence  seems  to  be  generally  1000  to 
100,000.  The  length  of  the  basic  unit  of  such  repeated  sequences  varies  with 
different  DNA  class.  In  the  case  of  repetitive  DNA's  in  guinea  pig,  the  basic 
unit  of  one  of  the  two  strands  seems  to  be  a sequence  of  six  nucleotides 
(C-C-C-T-A-A  and  its  slight  modifications)  (Southern,  1970).  Note  that  the 
sequence  of  each  repeat  of  such  DNA's  is  generally  not  identical,  though  all 
repeats  have  very  similar  sequences. 

As  will  be  seen  from  fig.  8.2,  separation  of  repeated  and  nonrepeated  DNA 
is  clearly  arbitrary.  If  we  note  that  in  the  evolutionary  process  there  occurred 
a large  number  of  gene  duplications  in  the  genome  of  higher  organisms  and 
that  the  rate  of  nucleotide  substitution  in  evolution  is  very  slow,  it  is  expected 
that  the  experimentally  isolated  nonrepeated  DNA  also  includes  a substantial 
number  of  duplicate  genes. 

The  biological  functions  of  repeated  DNA's  are  virtually  unknown  at  the 
present  time.  A certain  proportion  of  repeated  DNA's  are  accounted  for  by  the 
genes  for  ribosomal  and  transfer  RNA's  but  the  total  amount  of  repeated 
DNA  is  much  larger  than  that  required  for  producing  these  RNA's.  Some 
types  of  repetitive  DNA's  in  mammals,  including  man,  are  apparently 
transcribed  (Saunders,  1974),  but  it  is  generally  believed  that  a majority  of 
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repeated  DNA’s  are  not  used  as  structural  genes.  In  fact,  the  highly  repetitious 
DNA's  in  mouse  and  guinea  pig  do  not  appear  to  be  transcribed  (Flamm 
et  al.,  1969;  Southern,  1970).  This  DNA  is  generally  concentrated  in  the 
heterochromatic  regions  (mostly  the  centromere  and  nucleolar  organizer 
regions)  of  chromosomes,  but  some  parts  are  apparently  interspersed  in  the 
whole  euchromatic  regions.  Britten  and  Davidson  (1969)  speculated  that 
repeated  DNA  plays  an  important  role  in  the  regulation  of  gene  function, 
but  no  evidence  seems  to  have  been  obtained.  Yunis  and  Yasmineh  (1971), 
on  the  other  hand,  proposed  that  it  functions  as  a structural  component 
('spacer  DNA7)  of  vital  regions  of  chromosomes  and  protects  these  regions 
from  destructive  chromosomal  changes.  While  their  arguments  are  not  very 
convincing  (at  least  to  me),  the  recent  study  by  Brown  (1973)  and  his  col- 
leagues indicate  that  the  spacer  DNA's  in  the  ribosomal  RNA  gene  region 
in  the  African  clawed  toads  Xenopus  are  highly  repetitious.  This  region  of 
DNA  consists  of  about  450  repeating  units,  each  of  which  includes  three 
major  sequences:  a gene  for  the  185  RNA,  a gene  for  the  285  RNA,  and  a 
'spacer'  DNA  that  is  not  transcribed  into  RNA.  (In  addition  to  these,  there 
are  two  small  pieces  of  spacer  DNA  in  each  repeat  that  are  transcribed  but 
eliminated  in  the  cleaving  process.)  The  nucleotide  sequence  of  the  gene  for 
each  of  the  two  types  of  RNA  is  the  same  for  all  repeats.  The  nucleotide 
sequences  of  spacer  DNA  are  also  very  similar  though  not  identical. 

The  evolution  of  repeated  DNA  remains  somewhat  mysterious.  Certain 
families  of  repeated  DNA  such  as  those  for  ribosomal  and  transfer  RNA  are 
apparently  the  product  of  repeated  duplication,  which  enabled  higher 
organisms  to  synthesize  a large  quantity  of  gene  products.  A large  part  of 
repeated  DNA,  however,  does  not  appear  to  have  any  vital  function. 
The  families  of  repeated  DNA  range  from  groups  of  almost  identical 
sequences  to  those  with  divergent  sequences.  From  this  observation,  Britten 
and  Kohne  (1968)  have  suggested  that  repeated  DNA  arises  from  large-scale 
precise  duplication  of  selected  sequences  and  then  undergoes  divergence  due 
to  mutation,  deletion,  and  insertion  of  nucleotide  pairs.  According  to  them, 
the  large-scale  gene  duplication  occurs  rather  rapidly,  since  the  sequences  of 
repeated  DNA  are  generally  very  similar  within  the  same  species  but  quite 
different  even  between  closely  related  species.  Britten  and  Kohne  called  this 
sort  of  large-scale  gene  duplication  saltatory  replication,  but  gave  no  ex- 
planation of  how  it  really  occurs.  If  repeated  DNA  has  no  vital  biological 
function,  how  can  a piece  of  DNA  about  300  bases  long  be  multiplied  100 
to  100,000  times  in  a relatively  short  period  of  evolutionary  time? 

Most  molecular  biologists  (e.g.  Britten  and  Kohne,  1968;  Walker,  1971) 
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seem  to  believe  that  repeated  DNA  has  spread  through  the  population  of  a 
species,  because  it  conferred  some  selective  advantage  to  the  individual 
which  carries  it.  This  is  of  course  not  necessarily  true.  Repeated  DNA  can 
be  fixed  in  a population  purely  by  random  genetic  drift,  even  if  it  has  no 
selective  advantage.  Then,  it  is  possible  that  at  least  some  families  of  repeated 
DNA  have  been  derived  from  already  nonfunctionalized  genes  (Nei  and 
Roychoudhury,  1973b).  Such  nonfunctional  and  selectively  neutral  DNA  may 
be  multiplied  hundreds  and  thousands  of  times  by  unequal  crossing  over.  As 
indicated  by  Flamm  (1972),  only  about  25  rounds  of  reduplication  would  be 
required  to  produce  30  million  copies  from  a single  nucleotide  sequence,  if 
each  duplication  doubles  the  number  of  copies. 

However,  a recent  study  by  Brown  (1973)  and  his  colleagues  on  the 
ribosomal  RNA  gene  region  in  Xenopus  Iaevis  and  X.  mulleri  has  made  this 
question  more  difficult  to  answer.  As  mentioned  earlier,  this  region  consists 
of  a series  of  repeats  of  the  RNA  genes  and  spacer  genes.  Brown  and  his 
colleagues  have  shown  that  the  nucleotide  sequences  of  spacer  DNA  are 
virtually  the  same  in  the  same  species  but  different  between  X.  Iaevis  and 
X.  mulleri . (About  10  percent  of  the  nucleotides  are  different.)  If  the  spacer 
DNA's  in  the  two  species  have  been  derived  from  the  same  spacers  in  their 
common  ancestor,  we  would  expect  that  the  spacer  sequences  in  different 
repeats  of  the  same  species  are  differentiated  to  the  same  degree  as  those 
between  the  species.  The  explanation  becomes  harder  when  we  note  that  the 
nucleotide  sequences  of  the  18N  and  28 N RNA  genes  are  very  similar  even 
among  distantly  related  organisms.  The  genes  that  code  for  ribosomal  RNA 
in  higher  plants  are  closer  in  sequence  to  those  in  Xenopus  than  spacer 
sequences  of  X Iaevis  are  to  those  of  X mulleri. 

There  are  two  ways  to  explain  Brown's  observations.  One  is  to  assume 
that  the  spacer  DNA  in  all  repeats  are  occasionally  replaced  by  duplicate 
copies  of  a single  sequence.  The  other  is  to  use  Callan’s  (1967)  hypothesis  of 
master- slave  DNA  and  assume  that  only  one  repeat  of  the  18S,  28N,  and 
spacer  genes  is  transmitted  from  generation  to  generation  and  all  other 
repeats  are  slave  DNA.  Neither  of  these  two  hypotheses  has  any  experimental 
support.  It  should  be  noted  that  the  master-slave  gene  hypothesis  does  not 
apply  to  all  families  of  repeated  DNA,  since  some  families  clearly  consist 
of  multiple  copies  of  similar  but  slightly  differentiated  sequences.  In  the 
master-slave  gene  hypothesis,  multiple  copies  of  identical  sequence  are 
expected  to  be  produced. 
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8.1.5  Nonfunctional  DNA 

As  already  mentioned,  a large  part  of  highly  repeated  DNA  is  apparently 
nonfunctional  in  the  sense  that  it  does  not  transcribe  any  RNA.  The  non- 
functionality of  a part  of  duplicate  genes  can  be  explained  by  the  accumula- 
tion of  deleterious  mutations.  As  was  first  indicated  by  Haldane  (1933),  if 
there  are  two  or  more  identical  genes  in  the  genome,  all  the  genes  except 
one  may  become  nonfunctional  if  one  gene  is  able  to  produce  the  necessary 
quantity  of  gene  product.  Nei  (1969a)  postulated  that  a large  number  of 
nonfunctional  genes  have  accumulated  in  higher  organisms,  since  gene 
duplication  must  have  occurred  many  times  in  the  evolutionary  process. 
From  the  genetic  load  argument,  Ohta  and  Kimura  (1971a)  estimated  that 
more  than  90  percent  of  the  D N A in  the  mammalian  genome  is  nonfunctional. 
Crick  (1971)  speculated  that  in  Drosophila  the  structural  genes  reside  in  the 
interband  regions  of  salivary  chromosomes  which  contain  about  5 percent 
of  the  total  genome.  The  RNA- DNA  hybridization  experiment  by  Turner 
and  Laird  (1973),  however,  suggests  that  at  least  24  percent  of  the  total 
DNA  is  transcribable.  The  exact  proportion  of  functional  DNA  in  higher 
organisms  still  remains  to  be  determined. 

Nei’s  argument  is  based  on  a simple  mathematical  computation.  Namely, 
a lethal  or  nonfunctional  mutation  occurring  in  one  of  the  duplicate  loci 
would  be  harmless  and  behave  as  a neutral  or  near-neutral  gene  in  popula- 
tions, as  long  as  the  other  duplicate  gene  or  genes  function  normally.  The 
rate  of  fixation  of  such  mutations  in  relatively  small  populations  is  therefore 
equal  to  the  mutation  rate  (eh.  5).  Since  the  lethal  mutations  per  generation 
are  roughly  IQ- 5 per  locus,  a considerable  number  of  genes  are  expected 
to  become  nonfunctional  if  there  are  many  duplicate  genes.  This  argument 
does  not  hold  if  population  size  is  very  large  (Fisher,  1935),  but  a more 
detailed  study  of  this  problem  has  shown  that  if  the  effective  population  size 
is  less  than  2000,  the  accumulation  of  nonfunctional  genes  is  substantial 
(Nei  and  Roychoudhury,  1973b).  We  note  that  the  effective  size  to  be  used 
for  deleterious  genes  is  that  of  a local  population  (Nei,  1968),  while  the 
effective  size  for  neutral  genes  is  that  of  the  whole  species  when  migration 
occurs  among  local  populations  (Kimura  and  Maruyama,  1971).  In  the 
evolutionary  process,  some  duplicate  genes  would  certainly  acquire  a new 
function  by  mutation.  However,  the  probability  of  such  events  seems  to  be 
very  small,  since  mutation  is  a random  process. 

One  might  wonder  why  there  are  so  many  functional  duplicate  genes  for 
ribosomal  or  transfer  RNA  if  the  above  hypothesis  is  correct.  The  reason 
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seems  to  be  that  a large  quantity  of  ribosomal  and  transfer  RNA  is  required 
for  protein  synthesis.  If  lethal  mutations  occur  at  some  of  these  loci,  they  are 
expected  to  reduce  the  fitness  of  heterozygotes,  so  that  they  will  quickly  be 
eliminated  from  the  population.  In  fact,  the  probability  of  fixation  of  non- 
functional genes  at  duplicate  gene  loci  decreases  considerably  if  these  genes 
reduce  the  heterozygote  fitness  to  a small  extent. 

It  has  long  been  known  that  the  Y chromosome  in  most  organisms  lacks 
functional  genes  except  for  some  special  kinds  of  genes  such  as  those  for 
sex  determination,  male  fertility,  and  ribosomal  RNA  (Stern,  1929;  Ritossa 
and  Spiegelman,  1965;  Mittwoch,  1967;  Hess  and  Meyer,  1968).  The  Y 
chromosome  is  generally  heterochromatic  but  devoid  of  so-called  repeated 
DNA  (Yunis  and  Yasmineh,  1971),  though  in  some  organisms  the  presence  of 
repeated  DNA  is  suspected  (Blumenfeld  and  Forrest,  1971).  Muller  (1914) 
seems  to  be  the  first  to  postulate  that  the  inactivation  of  the  Y chromosome  is 
the  result  of  accumulation  of  lethal  genes.  He  argued  that  the  gene  loci  on  the 

Y chromosome  are  always  kept  heterozygous,  so  that  any  lethal  mutations 
occurring  at  these  loci  are  sheltered  by  the  wild-type  allele  at  the  homologous 
loci  on  the  X chromosome,  while  the  lethal  mutations  occurring  on  the  X 
chromosome  are  eliminated  in  the  homogametic  sex,  where  the  lethal  muta- 
tions may  become  homozygous.  This  argument  was  once  rejected  by  Fisher 
(1935),  who  showed  that  the  probability  of  accumulation  of  lethal  genes  on  the 

Y chromosome  is  extremely  small  in  large  populations.  Recently,  however, 
Nei  (1970)  showed  that  the  probability  is  not  small  in  populations  of 
relatively  small  effective  size  (roughly  less  than  2000)  and  argued  that  the 
inactivation  of  the  Y chromosome  has  probably  occurred  according  to  the 
scheme  proposed  by  Muller.  Experimental  support  of  Muller's  hypothesis 
has  been  provided  by  Kid  well  (1  972).  She  studied  the  fixation  of  lethal  genes 
in  the  Glued-Stubble  region  (16.8  centimorgans)  of  the  third  chromosome 
which  had  been  kept  heterozygous  (Gl-Sb/  + + ) in  populations  of  sizes 
8 ~ 48.  These  populations  were  originally  started  to  study  the  effectiveness 
of  natural  selection  for  reduced  recombination.  Tests  of  lethal  genes  revealed 
that  at  least  one  lethal  gene  was  fixed  on  the  non-GI-Sb  chromosome  in 
five  of  the  10  populations  studied  within  60  generations.  Lethal  genes  fixed 
on  the  Gl-Sb  chromosome  could  not  be  detected  because  GI  and  Sb  are 
homozygous  lethal. 

Muller's  idea  on  the  accumulation  of  lethal  genes  on  sheltered  chromo- 
somes applies  also  to  the  chromosomes  in  asexual  and  parthenogenetic 
organisms,  if  they  are  diploid  or  polyploid.  Since  these  organisms  undergo 
no  segregation  and  recombination,  all  alleles  at  a locus  except  one  may 
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become  nonfunctional.  Another  example  of  sheltered  chromosomes  is  the 
translocation  chromosomes  in  Oenothera  which  are  kept  heterozygous 
permanently.  In  this  organism  lethal  genes  have  already  been  accumulated, 
so  that  homozygotes  for  translocations  can  no  longer  survive. 

8.2  Nucleotide  substitution  in  DNA 

8.2.1  Some  theoretical  backgrounds 

In  the  foregoing  sections  we  were  mainly  concerned  with  the  evolutionary 
change  of  the  DNA  content.  Another  important  change  of  DNA  in  evolution 
is  the  substitution  of  nucleotide  pairs. 

In  modeling  the  nucleotide  substitution  in  evolution,  we  assume  that  the 
substitution  occurs  at  any  nucleotide  site  with  equal  probabilities  during  a 
given  evolutionary  time,  and  at  each  site  a given  nucleotide  mutates  with 
equal  probability  to  any  one  of  the  remaining  three.  Let  ft  be  the  probability 
of  identity  of  nucleotides  at  a given  site  between  two  homologous  cistrons 
at  time  t (measured  in  years)  after  the  divergence,  and  Xb  be  the  probability 
of  nucleotide  substitution  per  base  per  year.  Then,  we  have  the  following 
recurrence  equation 

^ x 3'j 

+ [awi  - ■*>)  X J + ^ X gj  ()  - I,). 

The  value  of  Xb  is  very  small,  so  that  the  terms  involving  A*  can  be  neglected. 
If  we  replace  it+1  - it  by  dijdt,  then 


Solution  of  the  above  equation  with  the  initial  condition  ?0  = 1 gives 

}t  ^ | „ l [L  - (3.2) 

(Nei  and  Chakraborty,  unpublished).  The  expected  number  of  nucleotide 
substitutions  per  base  (5h)  is  2 Xbt,  so  that  it  can  be  estimated  by 

^--iloe'('  -j4 


{8.3) 
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where  n = I - i is  the  proportion  of  different  nucleotides  between  the  two 
homologous  cistrons.  The  above  formula  is  identical  to  that  obtained  by 
Kimura  and  Ohta  (1972a)  using  a different  method  (see  also  Jukes  and 
Cantor,  1969).  Clearly,  the  number  of  nucleotide  substitutions  per  codon  is 

<5C  = 3Sb.  (8.4) 

Holmquist  (1972a,  b)  studied  the  relation  of  the  proportion  of  different 
amino  acids  between  two  homologous  polypeptides  ( paa ) to  the  proportion  of 
different  nucleotides  between  the  corresponding  cistrons  (n  = I - i)  by 
using  the  property  of  the  genetic  code.  Kimura  and  Ohta  (1972a)  showed 
that  Holmquist's  relationship  can  be  approximated  by 

p„  = I - (1  — n)?(l  - s/4).  (8.5) 

This  formula  is  derived  by  noting  the  probability  that  two  homologous 
codons  code  for  the  same  amino  acid  is 

1 - /^  ■=  (1  - - rc)  T ^ rcj-. 

This  is  because  (1  - n)a  represents  the  probability  that  the  two  codons  are 
the  same  with  respect  to  the  first  two  positions,  while  (1  — n)  and  3n/4  in 
the  braces  give  respectively  the  probability  that  the  third  position  is  the  same 
and  the  probability  that  the  third  position  is  different  but  codes  for  the 
same  amino  acid.  The  last  mentioned  probability,  i.e.  37r/4,  is  an  approxima- 
tion based  on  the  property  of  the  genetic  code  (table  3.1). 

The  relationship  among  n,  paa , and  6,  is  tabulated  by  Kimura  and  Ohta 
(1972a).  Formulae  (8.4)  and  (8.5)  are  useful  when  n is  large.  In  general, 
however,  n is  very  small  compared  with  unity.  In  this  case  we  have 

P„  - W,  <8,6) 

5C  = 3it  ($.7) 

approximately.  From  (8.6),  it  is  clear  that  the  rate  of  amino  acid  substitu- 
tions (A)  is  related  to  the  rate  of  nucleotide  substitutions  (A)  by 

- (4/9M.  m) 

In  the  above  formulations  we  have  assumed  that  A,  is  the  same  for  all 
bases  in  a cistron.  This  assumption  is  clearly  incorrect,  since  the  functional 
requirement  of  proteins  often  prohibits  nucleotide  substitutions  at  certain 
positions.  A good  example  is  the  codons  for  active  sites  of  proteins,  where 
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amino  acid  substitutions  occur  very  rarely.  If  Xb  varies  from  site  to  site, 
(8.3)  gives  an  underestimate  of  2 Xbt,  as  in  the  case  of  estimation  of  genetic 
distance  (7.9).  If  the  variance  of  2 Xbt  is  known,  a correction  for  this  factor 
can  be  made.  At  the  present  time,  however,  we  do  not  have  good  estimates 
of  the  variance  of  Xb  or  X. 

8.2.2  DNA  hybridization 

As  mentioned  earlier,  the  chemical  determination  of  nucleotide  sequence  in 
DNA  is  very  expensive  and  time-consuming.  If  the  sequence  could  be 
determined  at  the  rate  of  1 base  per  second,  it  would  require  4 months  to 
sequence  a bacterial  genome  and  over  100  years  to  sequence  one  mammalian 
DNA  (Hoyer  and  Roberts,  1967).  In  evolutionary  studies  it  is  often  important 
to  know  the  overall  difference  between  DNA's  from  two  different  species. 
For  this  purpose  DNA  hybridization  technique  can  be  used,  though  it  is 
quite  crude  at  the  present  time.  It  has  already  provided  some  interesting 
results  about  the  evolutionary  change  of  DNA.  Recent  reviews  on  this 
subject  have  been  published  by  Kohne  (1970)  and  Kohne  et  al.  (1972). 

The  basic  procedure  of  this  technique  is  as  follows:  1)  Denature  the  DNA 
molecules  from  the  two  species  under  investigation  into  single  strands, 
2)  hybridize  the  single  strands  of  DNA  from  one  species  with  those  of  the 
homologous  DNA  from  the  other  to  make  double-strand  DNA,  and  3) 
measure  the  thermal  stability  of  the  hybrid  DNA.  It  is  known  that  double- 
strand DNA,  when  heated,  dissociates  into  single  strands,  and  this  dissocia- 
tion occurs  at  a lower  temperature  when  there  is  any  mismatch  between  the 
bases  of  the  two  strands  than  when  all  the  bases  are  completely  matched.  It 
has  been  shown  that  about  1.5  percent  base-pair  mismatches  lower  thermal 
stability  by  1 °C  when  the  stability  is  measured  with  the  temperature  at 
which  50  percent  dissociation  of  the  hybrid  DNA  occurs  (see  Kohne  et  al., 
1972).  Therefore,  the  proportion  of  different  bases  between  DNA's  of  the 
two  species  may  be  determined  by  measuring  thermal  stability.  Note  that  the 
DNA  in  higher  organisms  is  quite  heterogeneous  and  there  are  several 
technical  problems  which  make  it  difficult  to  estimate  the  proportion  of 
different  bases  (McCarthy  and  Farquhar,  1972). 

As  mentioned  earlier,  the  DNA  of  higher  organisms  includes  a large 
amount  of  repeated  DNA.  Since  the  evolutionary  scheme  of  this  class  of 
DNA  is  not  well  known,  it  is  generally  eliminated  from  the  total  DNA  and 
only  the  nonrepeated  DNA  is  used  in  the  test  of  hybridization.  In  practice, 
however,  separation  of  repeated  and  nonrepeated  DNA's  is  somewhat 
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Table  8.3 


Rates  of  nucleotide  substitution  estimated  from  DNA  hybridization  experiments.  From 
Kohne  et  al.  (1972). 


DNA's 

compared 

Nucleotide 

differences 

(%) 

Years  after 
divergence 
x2 

Rate  of 
change 
per  year 
X 107* 

Generation 

time 

(years) 

Rate  of 
change  per 
generation 
x 107* 

Man-Chimp 

2.5 

3 x IQ1** 

0,8 

10 

a 

Man- Gibbon 

5,1 

6 x IS)1*" 

0.8 

10 

8 

Man-Green  Monkey 

9.Q 

9 x ICF 

1.0 

2-4 

1 

Man-Rhesus 

8.3 

9 x W 

0.9 

2-4 

2.7 

an-capuchin 

15.8 

13  x 10T 

IJ. 

2-4 

3 

Man- Galago 

42 

IS  x IQT 

2,6 

1-2 

3.9 

Mouse-Rat 

30 

2 x IQ7 

15,0 

0.13 

5 

Cow- Sheep 

11.2 

5 x I07 

2,2 

1-2 

3.3 

The  Poisson  correction  has  not  been  made. 

This  divergence  time  has  been  disputed  and  could  be  smaller  than  this  figure. 


arbitrary,  and  even  the  so-called  nonrepeated  DNA  is  expected  to  include  a 
substantial  amount  of  genes  of  low  duplications.  If  this  is  the  case,  the  rate 
of  nucleotide  substitution  determined  from  DNA  hybridization  is  expected 
to  be  an  overestimate.  Another  difficulty  is  that  the  proportion  of  non- 
repeated DNA  varies  with  organism,  and  thus  it  is  not  always  clear  whether 
the  same  classes  of  genes  are  used  or  not  when  different  pairs  of  species  are 
compared. 

Despite  these  difficulties,  this  method  has  been  used  by  several  authors  in 
measuring  nucleotide  differences  among  various  organisms.  Table  8.3  shows 
the  results  obtained  with  some  mammalian  species,  mostly  primates  (Kohne 
et  al.,  1972).  It  is  clear  that  the  nucleotide  differences  between  species  are 
larger  when  the  species  to  be  compared  are  remotely  related  than  when  they 
are  closely  related.  Thus,  the  proportion  of  different  nucleotide  pairs  is  2.5 
percent  between  man  and  chimpanzee,  while  it  is  42  percent  between  man 
and  galago.  Nevertheless,  the  proportion  of  different  nucleotide  pairs  is  not 
necessarily  proportional  to  the  time  after  divergence  of  species  in  chrono- 
logical years.  Particularly  noteworthy  is  a high  rate  of  nucleotide  sub- 
stitution in  mouse  and  rat.  From  this  result,  Kohne  et  al.  (1972)  argued  that 
the  rate  of  gene  substitution  has  been  slowed  down  in  the  primate  groups. 


228 


Long-term  evolution 


They  state  that  the  rate  of  nucleotide  substitution  is  affected  by  generation 
time  and  it  becomes  roughly  constant  if  time  is  measured  in  generations. 
However,  McConaughy  and  McCarthy’s  (see  McCarthy  and  Farquhar, 
1972)  estimate  of  different  nucleotides  between  mouse  and  rat  is  9 percent 
rather  than  30  percent.  If  we  take  this  estimate,  the  gene  divergence  becomes 
roughly  proportional  to  the  divergence  time  measured  in  years.  At  any  rate, 
the  present  data  from  DNA  hybridization  tests  appear  to  be  subject  to 
considerable  error. 

In  eh.  7 it  was  mentioned  that  the  electrophoretically  detectable  codon 
differences  between  man  and  chimpanzee  are  0.62  per  locus.  If  only  one 
fourth  of  amino  acid  differences  can  be  detected  by  electrophoresis,  the 
number  of  amino  acid  differences  between  human  and  chimpanzee  proteins 
is  estimated  to  be  2.5  per  polypeptide.  The  polypeptides  used  in  this  experi- 
ment had  about  300  amino  acids  on  the  average  (M.  King,  1973).  Therefore, 
the  genetic  distance  0.62  corresponds  to  about  one  codon  difference  per  100 
codons.  The  expected  nucleotide  differences  are  then  (4/9)  x 1 or  roughly  0.5 
percent  from  (8.4).  This  value  is  about  one-fifth  of  the  estimate  from  DNA 
hybridization  (2.5  percent).  The  estimate  of  nucleotide  differences  between 
man  and  Rhesus  monkey  can  be  compared  with  that  obtained  from  amino  acid 
sequences  of  hemoglobin  a-  and  /1-chains.  The  total  number  of  amino  acid 
differences  in  these  two  chains  is  12,  while  the  total  number  of  amino  acids 
involved  is  287.  Therefore,  the  proportion  of  different  amino  acid  differences 
is  4.2  percent.  From  (2.3),  6 = lit  is  estimated  to  be  0.043.  Thus,  the  estimate 
of  nucleotide  differences  per  base  pair  is  about  2 percent.  This  value  is  about 
one-fourth  of  the  estimate  given  in  table  8.3.  If  we  note  that  the  rate  of 
nucleotide  substitution  in  hemoglobin  is  close  to  the  average  for  various 
proteins,  this  indicates  that  the  nucleotide  differences  estimated  from  DNA 
hybridization  are  much  higher  than  those  obtained  from  amino  acid 
sequences,  as  indicated  by  Laird  et  al.  (1969). 

The  discrepancy  between  data  from  DNA  hybridization  and  protein 
differences  can  be  explained  in  several  different  ways.  1)  Effect  of  duplicate 
genes  coding  for  similar  polypeptides,  such  as  hemoglobin  /?-  and  6-chain 
genes  or  two  y-chain  genes  in  man.  2)  Inclusion  of  spacer  DNA  in  the  test 
of  DNA  hybridization.  As  discussed  earlier,  spacer  DNA  evolves  much 
faster  than  structural  DNA.  Since  protein  data  do  not  represent  spacer 
DNA,  the  nucleotide  differences  estimated  from  protein  data  would  be 
smaller  than  those  from  DNA  hybridization.  3)  Technical  difficulties  in 
DNA  hybridization  (McCarthy  and  Farquhar,  1972).  4)  Mutations  at  the 
third  positions  in  codons  usually  do  not  affect  protein  structure,  and  the 


Table  8.4 


Amino  acid  differences  ( %)  in  cytochrome  c and  d between  different  organisms.  The  number  of  positions  compared  varies  with  the  pair  of 
organisms.  All  positions  are  used  in  a computation  except  those  in  which  both  sequences  have  a gap.  Cytochrome  co  in  bacteria  is  known  to 
be  homologous  with  cytochrome  c in  eukaryotes.  From  Dayhoff  (1972). 
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rate  of  nucleotide  substitution  at  these  positions  may  be  higher  than  at  the 
other  positions  (King  and  Jukes,  1969). 


8.3  Amino  acid  substitution  in  proteins 

8.3.1  Rate  <f  amino  acid  substitution 

In  ch.  2 we  have  seen  that  the  property  of  constant  rate  of  amino  acid 
substitution  can  be  used  for  constructing  phylogenetic  trees.  This  property 
was  first  noted  by  Zuckerkandl  and  Pauling  (1962)  and  Margoliash  (1963) 
in  their  comparative  studies  on  amino  acid  sequences  of  hemoglobin  and 
cytochrome  c.  Later,  this  was  confirmed  in  more  extensive  studies  by  Zucker- 
kandl and  Pauling  (1965)  and  Margoliash  and  Smith  (1965).  Let  us  now  study 
this  property  in  more  detail. 

The  proteins  of  which  the  amino  acid  sequences  have  been  studied  most 
extensively  are  cytochrome  c,  hemoglobin,  and  fibrinopeptides.  Table  8.4 
shows  the  amino  acid  differences  among  the  cytochrome  c sequences  from 
diverse  organisms.  It  is  clear,  as  in  the  case  of  hemoglobin  data  (table  2.2), 
that  the  cytochromes  c from  closely  related  organisms  are  more  similar  than 
those  from  distantly  related  organisms.  The  similarity  is  such  that  the  difference 
between  any  two  organisms  depends  almost  entirely  on  the  time  after 
divergence.  For  example,  the  difference  between  bacterial  cytochrome  c2 
(this  is  homologous  to  cytochrome  c in  eukaryotes)  and  cytochrome  c of 
any  other  (higher)  organism  is  virtually  the  same  (62  ~ 72  percent),  whether 
this  is  plant  or  animal.  Similarly,  the  cytochrome  c in  the  fungi  and  yeast 
groups  is  almost  equally  related  with  any  other  higher  organism,  the  amino 
acid  difference  being  41  to  50  percent.  A similar  dependence  of  amino  acid 
differences  on  the  divergence  time  can  be  seen  in  almost  all  proteins  so  far 
studied  (Dayhoff,  1972). 

Dickerson  (1971)  studied  the  relationship  between  the  accumulated  number 
of  amino  acid  substitutions  and  divergence  time  in  cytochrome  c,  hemoglobins, 
and  fibrinopeptides  A and  B by  using  formula  (2.3).  The  results  obtained  are 
given  in  fig.  8.3.  The  data  for  hemoglobin  include  not  only  those  of  the 
a-,  /?-,  y-,  and  S-chains  but  also  those  of  the  lamprey  globin  and  sperm  whale 
myoglobin.  As  was  mentioned  in  section  8.1,  all  these  polypeptides  are 
evolutionarily  homologous  and  the  rates  of  amino  acid  substitutions  are 
more  or  less  the  same.  It  is  seen  that  the  accumulated  number  of  amino  acid 
substitutions  per  codon  in  evolution  increases  approximately  linearly  with 
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increasing  divergence  time  in  each  protein.  There  is,  however,  a striking 
difference  in  the  rate  of  substitution  among  different  proteins.  The  rate  for 
hemoglobin  is  about  three  times  larger  than  that  for  cytochrome  c but  about 
three  times  lower  than  that  for  fibrinopep tides.  Such  differences  are  also 
observed  in  other  proteins  such  as  insulin,  ribonuclease,  and  immunoglobin, 
though  the  number  of  sequences  determined  in  these  proteins  is  rather 
limited  (table  3.6). 


Fig.  8.3.  Rates  of  amino  acid  substitution  in  the  fibrinopeptides,  hemoglobin,  and  cyto- 
chrome c.  Comparisons  for  which  no  adequate  time  coordinate  is  available  are  indicated 
by  numbered  crosses.  Point  1 represents  a date  of  1200  ± 75  MY  (million  years)  for  the 
separation  of  plants  and  animals,  based  on  a linear  extrapolation  of  the  cytochrome  c 
curve.  Points  2-10  refer  to  events  in  the  evolution  of  the  globin  family.  The  i separation 
is  at  point  3,  y//S  is  at  4,  and  oi'jt?  is  at  500  MY  (carp/lamprey).  From  Dickerson  (1971). 
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8.3.2  Differences  among  proteins 

Why  is  the  rate  of  amino  acid  substitution  so  much  different  for  different 
proteins?  The  answer  to  this  question  seems  to  be  that  the  functional 
requirement  of  each  protein  determines  the  rate  (Margoliash  and  Smith, 
1965;  Zuckerkandl  and  Pauling,  1965;  King  and  Jukes,  1969;  Dickerson, 
1971).  For  example,  the  fibrinopeptides  have  little  known  function  after  they 
are  cut  out  of  fibrinogen  when  it  is  converted  to  fibrin  for  blood  clotting. 
Thus,  virtually  all  amino  acids  can  be  replaced  by  any  other  amino  acids. 
Namely,  almost  all  mutations  occurring  at  the  cistron  for  the  polypeptides 
seem  to  be  selectively  neutral.  The  rate  of  amino  acid  substitutions  is  there- 
fore expected  to  be  close  to  the  mutation  rate  per  locus.  The  apparently 
functionless  parts  of  ribonuclease  also  show  a rate  of  amino  acid  sub- 
stitutions similar  to  that  of  fibrinopeptides  (Barnard  et  al.,  1972). 

On  the  other  hand,  there  is  a strong  functional  requirement  in  the  amino 
acid  sequence  of  cytochrome  c (Dickerson,  1971).  The  polypeptide  of  this 
protein  forms  a shell,  inside  which  the  heme  group  is  contained  with  one 
edge  of  the  heme  being  exposed  outside.  The  interior  amino  acids  are  mostly 
hydrophobic  and  apparently  cannot  be  replaced  by  hydrophilic  amino  acids. 
The  heme  is  attached  covalently  to  the  protein  through  cysteines  at  positions 
14  and  17.  The  amino  acids  at  these  positions  are  the  same  in  all  species. 
Amino  acids  at  the  surface  of  this  protein  are  less  restrictive  but  still  must 
form  a certain  structure  to  interact  with  cytochrome  oxidase  and  reductase, 
both  of  which  are  macromolecules  much  larger  than  cytochrome  c itself. 
This  strong  functional  requirement  rejects  many  mutational  changes  of 
amino  acids  in  this  protein  and  only  at  a limited  number  of  amino  acid  sites 
mutational  changes  are  accepted  freely. 


Table  8.5 


Rates  of  amino  acid  substitution  at  the  surface  and  heme  pocket  regions  of  the  hemoglobin 
a-  and  /5-chains  (Kimura  and  Ohta,  1973b). 


Region 

a-chain 

/5-chain 

Surface 

IA  (Iff) 

2.1 

Heme  pocket 

0.17  (19) 

0.24  (21) 

Note:  The  rate  represents  'per  amino  acid  site  per  year'.  The  values  in  the  table  should  be 
multiplied  by  10-9.  The  figures  in  brackets  are  the  number  of  amino  acid  sites 
involved. 
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A protein  of  which  the  functional  requirement  is  intermediate  between  the 
fibrinopeptidcs  and  cytochrome  c is  hemoglobin.  This  protein  also  contains 
the  hcinc  group,  and  the  interior  amino  acids  do  not  easily  accept  mutational 
changes.  In  the  a-chain  there  arc  19  amino  acid  sites  that  arc  involved  in 
the  so-called  heroic  pocket.  Replacement  of  amino  acids  at  these  sites  is 
known  to  cause  abnormal  function  of  the  hemoglobin  molecules  in  man 
(Perutz  and  Lehmann,  1968).  The  function  of  hemoglobin  is  to  bind  O, 
in  the  lung  and  interact  with  CO,  in  the  tissue,  and  the  surface  of  the  molecule 
has  no  essential  function  except  holding  the  other  important  amino  acids. 
Thus,  the  amino  acids  at  the  surface  can  easily  be  replaced  by  other  amino 
acids.  Kimura  and  Ohta  (1973b)  computed  the  rate  of  amino  acid  substitu- 
tion at  the  heme  pocket  and  at  the  surface  separately  for  the  or-  and  /1-chains. 
The  results  obtained  (table  8.5)  indicate  that  the  rate  of  amino  acid  sub- 
stitution at  the  surface  is  about  ten  times  higher  than  that  at  the  heme  pocket. 

The  slowest  rate  of  amino  acid  substitution  so  far  observed  is  that  of 
histone  IV.  There  are  only  two  amino  acid  differences  in  the  sequence  of  105 
amino  acids  between  calf  and  pea.  If  we  assume  that  plants  and  animals 
diverged  1.0  ~ 1.2  billion  years  ago  (see  fig.  8.3),  the  rate  of  amino  acid 
substitution  is  computed  to  be  roughly  1 x 10“ 1 1 per  site  per  year.  This  is 
about  1/100  of  the  rate  for  hemoglobin  chains  and  about  1/40  of  that  for 
cytochrome  c.  This  extremely  slow  rate  of  evolutionary  change  in  histone  IV 
is  believed  to  be  due  to  the  important  role  this  protein  plays  in  controlling 
the  expression  of  genetic  information  by  binding  DNA  in  the  nucleus. 
Similarly  slow  rates  of  evolutionary  change  have  been  observed  also  for 
transfer  and  ribosomal  RNA  (see  ch.  2).  Since  these  RNA’splay  an  important 
role  in  protein  synthesis,  many  nucleotide  substitutions  seem  to  result  in 
deleterious  effect.  Particularly,  in  the  case  of  transfer  RNA  nucleotide  sub- 
stitution seems  to  be  prohibited  at  the  three  nucleotides  of  the  codon 
recognition  region.  If  one  of  the  three  nucleotides  is  replaced  by  another,  it 
could  translate  a wrong  amino  acid  in  all  proteins  in  the  organism.  This 
would  bring  a disastrous  effect  in  development  and  physiology  of  an  organism. 

There  are  several  other  proteins  of  which  the  rates  of  amino  acid  sub- 
stitution are  known,  though  they  are  not  so  reliable  as  those  for  cytochrome 
c,  hemoglobin,  and  fibrinopeptides.  They  are  given  in  table  3.5. 

8.3.3  Is  the  rate  of  amino  acid  substitution  constant  in  a given  protein  ? 

In  fig.  8.2  we  have  seen  that  the  rate  of  amino  acid  substitution  for  a given 
protein  is  roughly  constant  when  time  is  measured  in  years.  This  problem 
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Table  8.6 


Evolutionary  rates  of  hemoglobins  and  cytochrome  c and  their  standard  errors.  The 
expected  standard  errors  are  also  given  for  each  comparison.  From  Ohta  and  Kimura 
(1971b). 


Comparison 

Twice 

divergence 

time 

A x 109 

X x 109 

Standard  error 
Observed  Expected 

Hemoglobin,  /1-type 

Spider  monkey-Mouse 

3-6 

L.225 

Human-Rabbit 

1,6 

G.6JJ 

Horse-Bovine  fetal 

1.0 

Flama-Bovine 

1.0 

1.306 

1.526 

0.610* 

0.29& 

Human  (5-Sheep  (A) 

1.6 

|. 288 

Rhesus  monkey-Goat 

1.6 

1.184 

Pig-Sheep  (C) 

1.0 

1231 

Hemoglobin,  a- type 

Human-Bovine 

1.6 

0.769 

Gorilla-Monkey 

0,8 

0,430 

Rabbit-Mouse 

1.6 

1.326 

0.973 

0.409 

0.299 

Horse-Sheep 

1.0 

1.442 

Pig-Carp 

7.3 

0-877 

Cytochrome  c 

Human-Dog 

1.6 

0.699 

Kangaroo-Horse 

ZA 

0.290 

Chicken- Rabbit 

6.0 

0-136 

Pig-Graywhale 

1.6 

0,121 

0.28-1 

0.2U8- 

0.114 

Snapping  turtle-Pigeon 

6,0 

0136 

Bullfrog-Tuna 

7.S 

0,207 

Rattlesnake-  Dogfish 

7,5 

0.364 

Statistically  highly  significant  by  F-test. 


has  been  studied  in  more  detail  by  Ohta  and  Kimura  (1971b).  They  estimated 
the  rate  of  amino  acid  substitutions  (A)  for  hemoglobin  a-  and  /J-chains  and 
cytochrome  c in  various  'semi-independent'  comparisons  among  different 
organisms  by  using  formula  (2.3).  The  variance  of  the  estimates  of  A for 
different  comparisons  was  then  compared  with  the  theoretical  variance  given 
by  (2.4).  The  results  obtained  are  given  in  table  8.6.  It  is  seen  that  the  observed 
variance  is  considerably  larger  than  the  theoretical  in  all  polypeptides  studied, 
the  variance  ratio  (F  value)  being  statistically  significant  in  hemoglobin 
/1-chain  and  cytochrome  c.  This  study  therefore  suggests  that  the  rate  of 
amino  acid  substitution  per  year  is  not  strictly  constant. 
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Fitch  and  Margoliash  (1967b)  and  Fitch  and  Markowitz  (1970)  studied 
the  distribution  of  the  number  of  codon  substitutions  per  site  in  cytochrome 
c.  They  first  constructed  an  evolutionary  tree  from  the  similarity  of  amino 
acid  sequences  in  29  widely  varying  species  from  Neurospora  to  man.  From 
this  phylogenetic  tree,  they  inferred  the  amino  acid  sequences  of  all  the 
common  ancestors  of  these  species  by  using  the  genetic  code.  They  then 
estimated  the  total  number  of  evolutionary  changes  of  codons  at  each  amino 
acid  site.  The  results  obtained  are  given  in  the  row  of  'Observed'  in  table  8.7. 
This  observed  distribution  was  compared  with  three  'model  distributions'. 
Model  1 assumes  that  all  codon  sites  are  equally  variable,  so  that  the  distribu- 
tion becomes  the  Poisson.  Model  2 assumes  that  there  are  some  invariable 
codons  but  the  others  are  equally  variable,  the  variable  part  following  the 
Poisson.  Model  3 assumes  that  there  are  some  invariable  codons  and  that 
there  are  two  classes  of  variable  codons,  i.e.  variable  and  hypervariable.  The 
best-fitting  distribution  for  each  of  the  three  possible  models  is  given  in 
table  8.7  in  comparison  with  the  observed.  It  is  clear  that  only  the  third 
model  gives  a reasonably  good  fit  to  the  data.  In  this  model  the  number  of 
invariable  codons  was  estimated  to  be  32,  the  remaining  81  being  divided 
into  two  groups  of  size  65  and  16.  The  first  of  these  two  groups  had  the 
mean  substitutions  of  3.2  and  the  second  10.1.  Thus,  the  rate  of  codon 
substitutions  is  about  three  times  higher  in  the  hypervariable  group  than  in 
the  variable  group.  Clearly,  this  result  supports  our  earlier  observation  that 
the  functional  requirement  of  this  protein  does  not  allow  all  codons  to  vary 
with  equal  probability. 


Table  8.8 


Covarions  and  the  rates  of  amino  acid  substitutions.  From  Fitch  (1972). 


Protein 

Codon 

substitutions 

Codons 

Ratei 

Covarions 

Rate2 

Cytochrome  c 

5 

104 

0.048 

10 

0.50 

a hemoglobin 

22 

141 

0.156 

50 

0.44 

fl  hemoglobin 

31 

146 

0.212 

39 

0.80 

Fibrinopeptide  A 

13 

19 

0.684 

18 

0.72 

Note:  ‘Codon  substitutions'  are  the  number  of  codon  substitutions  occurring  in  the 
indicated  gene  in  both  lines  of  descent  since  the  common  ancestor  of  the  horse  and 
the  pig.  'Codons'  is  simply  the  length  of  the  sequence.  ‘Ratei’  is  the  rate  of  sub- 
stitution/codon since  the  divergence  of  horse  and  pig.  'Rates'  is  the  rate  of  sub- 
stitution/covarion. 
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Fitch  and  Markowitz  conducted  a similar  statistical  analysis  for  various 
groups  of  organisms  and  discovered  an  interesting  property.  Namely,  when 
they  excluded  five  species  of  the  fungus  group  from  the  previous  29  species, 
their  estimate  of  the  proportion  of  invariable  codons  was  about  45  percent. 
When  plant  species  were  excluded,  it  increased  to  about  60  percent.  When 
only  mammalian  species  were  used,  the  proportion  was  even  higher.  They 
noticed  that  the  proportion  of  invariable  codons  is  negatively  proportional 
to  the  range  of  species  used,  i.e.  the  genetic  distance  (number  of  codon 
substitutions)  of  the  most  remotely  related  species  in  the  group  used.  Using 
a linear  extrapolation,  they  then  estimated  the  proportion  of  invariable 
codons  when  only  one  species  is  used.  It  was  about  90  percent.  This  result 
suggests  that  in  any  one  species  only  about  10  percent  of  the  cytochrome  c 
codons,  i.e.,  about  10  codons,  are  subject  to  evolutionary  change  at  any 
moment  in  the  course  of  evolution.  Fitch  and  Markowitz  called  these  codons 
the  concomitantly  variable  codons  or  covarions. 

Fitch  (1971b,  1972)  showed  that  the  numbers  of  covarions  in  hemoglobin 
a-  and  /1-chains  are  also  much  smaller  than  the  total  number  of  codons. 
Table  8.8  shows  the  estimates  of  the  number  of  covarions  for  four  poly- 
peptides. It  is  seen  that  the  proportion  of  covarions  is  higher  in  hemoglobin 
a-  and  /1-chains  than  in  cytochrome  c and  that  in  fibrinopeptide  A the 
covarions  include  virtually  all  codons.  Thus,  the  proportion  of  covarions  is 
higher  in  fast  evolving  proteins  than  in  slowly  evolving  ones,  as  expected. 
Table  8.8  also  includes  the  rate  of  codon  substitutions  per  covarion  (Rate,). 
Interestingly,  this  rate  is  roughly  the  same  for  all  polypeptides,  though  the 
rate  per  codon  (Rate,)  varies  considerably. 

One  might  wonder  why  the  number  of  variable  codon  sites  increases  as 
the  species  range  is  broadened.  The  reason  seems  to  be  that  there  are  several 
different  groups  of  covarions,  each  species  belonging  to  one  of  them,  and 
the  number  of  different  covarion  groups  included  becomes  large  when  a 
larger  range  of  species  is  used  in  the  analysis.  In  fact,  Fitch  (1971c)  showed 
that  the  fungi  and  metazoan  ( Drosophila , fish,  etc.)  groups  have  different 
covarions.  Fitch  and  Markowitz  suggest  that  in  a given  species  codon 
substitutions  are  generally  restricted  to  the  covarions,  but  occasionally  they 
induce  a new  group  of  covarions,  destroying  the  original  group.  A possible 
reason  for  this  change  of  covarion  groups  is  that  an  amino  acid  substitution 
at  some  position  starts  to  impose  a restriction  of  amino  acid  substitution  at 
other  positions.  For  example,  the  three  dimensional  structures  of  rat  and 
bovine  ribonucleases  (RNases)  are  well  understood.  Rat  RNase  has  amino 
acids  glycine  and  serine  at  positions  38  and  39,  respectively.  Glycine  could 
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mutate  to  aspartic  acid,  but  this  seems  to  be  damaging  because  it  could 
interact  with  lysine  at  position  41  and  pull  this  necessary  residue  out  of  the 
active  site  of  this  enzyme.  Also,  serine  could  mutate  to  arginine  and  there  is 
no  reason  that  this  might  not  be  acceptable.  In  bovine  RNase,  the  groups 
are  indeed  aspartic  acid  and  arginine,  but  the  positively  charged  arginine 
neutralizes  the  negatively  charged  aspartic  acid  and  probably  prevents  any 
deleterious  effect  of  the  aspartic  acid  on  the  critical  lysine  at  41.  If  this  is 
true,  the  substitution  of  serine  by  arginine  at  39  must  have  preceded  the 
substitution  of  glycine  by  aspartic  acid  at  38.  Interestingly,  the  amino  acids 
at  38  and  39  in  porcine  RNase  are  found  to  be  glycine  and  arginine,  respec- 
tively. This  illustrates  how  the  positions  of  a group  of  covarions  may  change: 
before  the  arginine  fixation  position  38  cannot  accept  aspartic  acid,  while 
after  the  arginine  fixation  the  newly  fixed  aspartic  acid  cannot  be  replaced 
by  a neutral  amino  acid  any  more.  Fitch  and  Markowitz  provide  some  more 
examples. 

The  concept  of  covarions  clearly  indicates  that  the  rate  of  amino  acid 
substitution  is  not  the  same  for  all  sites  and  at  a particular  site  the  rate  may 
change  according  to  what  amino  acids  are  present  at  the  positions  with 
which  it  interacts.  However,  this  concept  itself  is  not  incompatible  with 
the  idea  that  the  rate  of  amino  acid  substitution  is  constant  per  polypeptide, 
since  the  total  probability  of  amino  acid  substitution  per  polypeptide  per 
year  may  still  remain  approximately  the  same.  Langley  and  Fitch  (1973,1974) 
tested  this  hypothesis  by  using  the  concept  of  Poisson  process.  Their  method 
utilizes  codon  substitution  data  for  several  proteins  simultaneously,  assuming 
that  the  rate  of  codon  substitutions  per  unit  length  of  time  is  constant  for  a 
given  polypeptide  but  may  vary  with  polypeptide.  The  probability  of  r codon 
substitutions  during  time  length  t is  given  by  a modification  of  formula  (2.1), 
in  which  nX  is  replaced  by  mh  the  rate  for  the  i-th  protein.  Thus,  fitting  this 
formula  for  all  branches  of  the  evolutionary  trees  for  hemoglobin  e-  and 
/J-chains,  cytochrome  c,  and  fibrinopeptide  A,  they  estimated  the  relative 
values  of  mt  and  relative  time  lengths  of  each  evolutionary  branch  by  using 
the  maximum  likelihood  method.  The  constancy  of  mt  was  then  tested  by 
examining  the  deviation  of  the  observed  number  of  amino  acid  substitutions 
from  the  expected  for  each  branch.  The  total  X1  value  for  the  deviations  was 
highly  significant,  indicating  that  the  rate  of  amino  acid  substitutions  is  not 
constant.  It  is  noteworthy  that  in  this  test  no  estimate  of  divergence  time 
between  two  groups  of  organisms  is  required,  so  that  it  is  free  from  the  error 
due  to  dating  of  fossil  records. 

This  result  is  of  course  expected.  If  a large  amount  of  codon  substitution 
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data  are  used,  as  in  this  case,  even  a small  degree  of  deviation  from  constancy 
would  be  detected.  Strictly  speaking,  if  the  covarions  of  a protein  change 
from  time  to  time,  as  shown  earlier,  the  rate  of  codon  substitutions  should 
not  be  constant  over  all  evolutionary  branches.  Even  if  the  majority  of  codon 
substitutions  are  neutral  with  respect  to  protein  function,  some  mutations 
may  occasionally  confer  selective  advantage  to  the  individual  possessing  the 
mutants,  and  the  codon  substitution  may  be  accelerated.  Dickerson  (1971) 
states  that  this  acceleration  of  codon  substitution  would  occur  particularly 
when  a new  gene  is  created  from  a duplicate  gene  but  still  in  the  process  of 
modification.  It  may  also  occur  when  the  functional  requirement  of  a protein 
changes.  For  example,  the  high  rate  of  amino  acid  substitution  in  guinea  pig 
insulin  seems  to  be  due  to  the  fact  that  this  protein  has  lost  zinc  constraint 
(Kimura  and  Ohta,  1974). 

We  have  emphasized  the  nonconstancy  of  the  rate  of  codon  substitution 
in  evolution.  However,  we  note  that  the  rate  is  still  roughly  constant  over 
most  of  the  evolutionary  time,  as  we  have  seen  in  fig.  8.3.  Langley  and 
Fitch's  (1974)  detailed  analysis  also  supports  this  view.  Fig.  8.4  shows  the 
maximum  likelihood  estimates  of  codon  substitutions  after  divergence  of 


Fig.  8.4.  Maximum  likelihood  estimates  of  codon  substitutions  after  divergence  of 
various  mammalian  groups  plotted  against  geological  time  estimates.  The  dots  and  ‘ X ' 
marks  indicate  the  points  of  divergence,  the  numbers  beside  them  referring  to  the  nodes 
given  in  the  phylogenetic  tree  in  fig.  8.4.  The  geological  time  estimates  for  the  ‘ X ' points 
are  somewhat  dubious.  Also,  the  divergence  times  for  points  1 and  2 are  probably  over- 
estimates. From  Langley  and  Fitch  (1974). 
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A.  The  numbers  along  each  leg  give  the  ratio  of  observed  and  expected  substitutions  for 
the  proteins  examined.  From  Langley  and  Fitch  (1974). 

various  mammalian  groups  plotted  against  geological  time  estimates.  The 
dots  and  ‘x’  marks  indicate  the  points  of  divergence,  the  numbers  besides 
them  referring  to  the  nodes  of  the  phylogenetic  tree  constructed  (fig.  8.5). 
It  is  seen  that,  except  in  the  primate  group,  the  number  of  codon  substitu- 
tions is  roughly  proportional  to  the  divergence  time.  In  this  connection  it  is 
worthwhile  to  note  that  the  dating  of  points  1 (divergence  between  man  and 
gorilla)  and  2 (divergence  between  apes  and  gibbons)  has  recently  been 
questioned  and  may  be  considerably  shorter  than  the  times  given  in  this 
figure  (see  sec.  8.4).  A similar  result  has  been  obtained  with  amino  acid 
substitution  data  in  myoglobin  (Romero-Herrera  et  al.,  1973). 


8.4  Phylogenetic  trees 

8.4  J Cw/wj  or  tu t etc ot life  Aubstftiitioit  data 


As  we  have  seen  above,  the  rate  of  codon  substitution  seems  to  be  roughly 
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constant  when  time  is  measured  chronologically.  This  property  provides  a 
useful  method  of  constructing  phylogenetic  trees  of  organisms,  though  there 
is  always  some  danger  that  the  tree  produced  considerably  deviates  from  the 
true  tree.  The  general  methods  of  constructing  evolutionary  trees  are  essen- 
tially the  same  as  those  used  in  numerical  taxonomy,  and  the  principle  is  to 
minimize  the  deviation  of  the  constructed  tree  from  the  observed  data 
(Fitch  and  Margoliash,  1967a;  Dayhoff,  1969).  The  trees  constructed  by 
these  methods  generally  agree  with  those  based  on  fossil  records  and 
morphological  differences.  When  amino  acid  sequence  data  are  available 
for  several  different  proteins  in  the  same  group  of  animals,  several  phylo- 
genetic trees  can  be  made  for  the  group.  The  trees  obtained  generally  have 
the  same  phylogenetic  feature  (Langley  and  Fitch,  1974).  An  improved 
composite  tree  can  be  made  by  combining  all  sequence  data  (Dayhoff,  1969). 
One  of  the  best  such  methods  so  far  available  seems  to  be  that  of  Langley 
and  Fitch  (1973,  1974),  of  which  the  principle  has  already  been  mentioned 
(section  8.3).  In  this  method  the  effect  of  random  fluctuation  inherent  in  the 
process  of  codon  substitution  is  minimized,  since  several  protein  data  are 
used  simultaneously. 

The  phylogenetic  tree  for  vertebrate  animals,  produced  by  this  method 
using  cytochrome  c,  hemoglobin  a and  /?,  and  fibrinopeptide  A,  is  given  in 
fig.  8.5.  Comparison  of  this  tree  with  the  corresponding  part  of  fig.  2.2 
indicates  that  the  molecular  tree  is  in  good  agreement  with  the  tree  based 
on  geological  data.  We  have  already  mentioned  that  the  relative  evolutionary 
times  of  different  branches  of  the  molecular  tree  also  agree  with  geological 
time  estimates. 

As  mentioned  earlier,  fossil  records  are  missing  or  very  fragmentary  in 
many  groups  of  organisms.  In  these  organisms,  phylogenetic  trees  are  now 
being  constructed  for  the  first  time  by  using  this  technique.  Also,  in  classical 
evolutionary  studies  it  was  difficult  to  construct  a reasonable  evolutionary 
scheme  of  different  phyla.  It  is  expected  that  in  the  near  future  even  this 
problem  will  be  solved  by  the  molecular  approach.  It  is  notable  that 
McLaughlin  and  Dayhoff  (1973)  were  recently  able  to  construct  a phylo- 
genetic tree  for  the  five  kingdoms  of  organisms,  Monera,  Protista,  Plantae, 
Fungi,  and  Animalia  by  using  cytochrome  c.  In  ch.  2 I have  mentioned  that 
this  method  is  useful  even  in  uncovering  the  earliest  stage  of  life  by  using  a 
slowly  evolving  transfer  or  ribosomal  RNA. 
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8.4.2  Immunological  data 

It  has  long  been  known  that  immunological  reaction  can  be  used  for  clarifying 
the  genetic  relationship  among  different  species  (Leone,  1964).  Recently  this 
technique  has  been  improved  considerably.  There  are  several  different 
methods,  such  as  quantitative  precipitation,  immunodiffusion,  etc.,  but  the 
simplest  and  most  useful  method  seems  to  be  that  of  quantitative  micro- 
complement fixation  of  purified  albumin,  initiated  by  Sarich  and  Wilson 
(1966).  Briefly,  the  method  is  as  follows:  The  antisera  to  be  used  are  produced 
by  immunization  of  rabbits  with  purified  serum  albumin  from  an  organism 
of  the  group  to  be  tested,  say  man.  The  antisera  produced  strongly  react 
with  human  albumin  (homologous  antigen)  but  less  strongly  with  that  from 
another  organism  (heterologous  antigen)  for  a given  concentration  of  anti- 
sera. If  the  serum  concentration  is  raised,  however,  the  reaction  with 
heterologous  antigen  increases  to  the  level  for  homologous  antigen,  The 
degree  of  antigenic  difference  between  pairs  of  albumins  is  measured  by  the 
factor  by  which  the  antiserum  concentration  must  be  raised  in  order  for  a 
heterologous  albumin  to  produce  the  same  reaction  as  that  with  a homologous 
albumin.  This  factor  is  called  the  index  of  dissimilarity  (I.D.).  The  antigen- 
antibody  reaction  is  measured  by  a method  called  quantitative  complement 
fixation.  Sarich  and  Wilson  (1967)  showed  that  the  logarithm  of  I.D.,  which 
is  called  the  immunological  distance,  is  approximately  linearly  related  to  the 
time  after  divergence  between  the  two  organisms  tested.  Using  lysozymes 
instead  of  albumin,  Prager  and  Wilson  (1971)  have  shown  that  log  I.D.  is 
linearly  related  to  the  proportion  of  different  amino  acids  between  the  two 
sequences  compared.  The  reason  why  log  I.D.  should  be  a linear  function  of 
the  proportion  of  different  amino  acids  is  not  known.  Furthermore,  whether 
the  same  property  holds  for  albumin  is  not  known.  (Albumin,  consisting  of 
about  500  amino  acids,  is  a much  larger  protein  than  lysozyme,  which  is 
composed  of  about  120  amino  acids,  and  for  measuring  genetic  distance  it 
behaves  much  better.  However,  the  amino  acid  sequence  of  this  protein  is 
poorly  known.)  Nevertheless,  the  empirical  property  of  log  T.D.  is  very 
useful  for  measuring  genetic  distance  between  species,  since  the  technique  is 
much  simpler  than  amino  acid  sequencing. 

Using  this  technique,  Sarich  and  Wilson  and  their  associates  have  obtained 
several  interesting  results.  As  mentioned  earlier,  the  fossil  record  for  human 
evolution  is  quite  fragmentary.  Many  anthropologists  believe  that  the 
human  lineage  was  separated  from  the  African  ape  lineage  at  the  latest  about 
14  million  years  ago  (Uzzell  and  Pilbeain,  1971).  Some  claim,  however,  that 
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the  separation  of  man  from  apes  was  as  recent  as  about  5 million  years  ago. 
Sarich  and  Wilson  (1967)  have  shown  that  the  immunological  data  are 
consistent  with  the  latter  view.  This  view  is  also  supported  by  the  amino  acid 
sequence  data  for  hemoglobins  (Wilson  and  Sarich,  1969).  Of  course, 
Sarich  and  Wilson's  data  can  be  explained  by  Goodman's  (Goodman, 
1963;  Goodman  et  al.,  1974)  view  that  the  rate  of  molecular  evolution  has 
slowed  down  in  the  primate  group,  though  such  a view  has  been  criticized 
by  Sarich  and  Wilson  (1973). 

Another  interesting  result  obtained  using  immunological  techniques  is 
that  a pair  of  species  that  belong  to  the  same  genus  in  frogs  often  have  an 
immunological  distance  as  large  as  that  observed  between  different  families 
or  orders  in  mammals  (Wallace  et  al.,  1971).  For  example,  the  albumin 
immunological  distance  (log  I.D.)  between  Rana  pipiens  (North  American 
frog)  and  R.  corrugata  (Ceylon  frog)  is  1.76,  while  the  distance  between  man 
and  carnivore  species  (Hyaena,  Genetta,  Ursus,  and  Arctogolida ) is  1.62 
(Sarich  and  Wilson,  1973).  Note  that  man  and  carnivores  belong  to  different 
orders.  Therefore,  there  seems  to  be  a considerable  differencebetween  albumin 
evolution  and  morphological  evolution.  The  large  differences  in  albumin 
among  frog  species  can  be  explained  by  the  assumption  that  the  divergence 
of  frog  species  occurred  a long  time  ago  and  albumin  has  undergone  a 
considerable  change,  though  morphological  characters  have  not  changed 
correspondingly. 

The  immunological  technique,  however,  is  not  very  powerful  for  a group 
of  organisms  which  are  related  too  distantly  or  too  closely.  For  example, 
bird  albumins  generally  do  not  react  with  mammalian  antisera.  Also,  if 
log  I.D.  is  larger  than  2,  the  linearity  with  divergence  time  is  destroyed.  The 
immunological  distance  between  a pair  of  mammalian  species  is  generally 
lower  than  2,  but  in  frogs  a pair  of  species  belonging  to  the  same  family 
often  shows  a distance  larger  than  2.  In  this  case  amino  acid  sequence  data 
are  much  more  reliable.  On  the  other  hand,  if  the  species  compared  are  too 
closely  related,  the  technique  is  again  unreliable,  since  it  depends  on  the 
measurement  of  a single  protein.  In  this  case  the  electrophoretic  method 
mentioned  in  ch.  7 seems  to  be  more  reliable. 

8.4.3  Phylogenies  cf  homologous  proteins 

In  section  8.4.1,  we  used  amino  acid  sequence  data  mainly  for  constructing 
a phylogenetic  tree  of  a group  of  organisms.  However,  they  can  also  be  used 
for  making  a phylogenetic  tree  for  a group  of  related  proteins.  As  mentioned 
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Fig.  8.6.  Evolution  of  the  genes  for  the  human  globins.  Insufficient  evidence  is  available 
to  place  the  fetal  e and  £ genes  on  the  tree  with  certainty;  however,  the  e-chain  appears  to 
be  most  similar  to  the  /3-chain  and  the  C-chain  to  the  y-chain.  From  Dayhoff  et  al,  (1972b). 


earlier,  myoglobin  and  all  hemoglobin  genes  have  evolved  apparently  from  a 
single  common  ancestor  gene.  Since  the  rate  of  amino  acid  substitution  in 
these  globin  polypeptides  are  roughly  the  same,  approximate  evolutionary 
times  of  the  globins  can  be  estimated.  This  sort  of  phylogenetic  tree  is  very 
useful  in  understanding  the  evolution  of  protein  functions. 

The  phylogenetic  tree  for  the  globins  is  given  in  fig.  8.6.  It  is  clear  that  the 
separation  of  hemoglobin  and  myoglobin  occurred  by  gene  duplication 
about  1100  million  years  ago,  long  before  the  evolution  of  vertebrates.  The 
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first  hemoglobin-like  protein  appears  to  have  been  a monomer  with  a 
molecular  weight  of  about  17,000.  A single-chain  globin  still  exists  in  a lower 
vertebrate,  the  lamprey.  The  next  step  of  globin  evolution  was  the  gene 
duplication  which  produced  two  different  chains,  a and  [>.  The  mutual 
adaptation  of  these  two  chains  resulted  in  the  formation  of  the  tetramer 
hemoglobin,  consisting  of  two  a-chains  and  two  /i-chains.  This  form  of 
hemoglobin  now  exists  in  all  species  of  mammals.  Later,  the  /i-chain  gene 
was  duplicated  and  the  gene  for  the  y was  produced.  The  human  y-chains 
are  synthesized  in  the  fetus,  while  the  /i-chains  occur  in  children  and  adults. 
Rather  early  in  primate  evolution,  the  /1-chain  gene  was  again  duplicated, 
producing  a new  gene  for  the  b-chain.  Most  primates  seem  to  have  this 
chain,  though  rhesus  monkey  does  not  (Boyer  et  al.,  1971).  In  man  both 
/l-  and  b-chains  are  found  in  adults  in  tetramer  forms  with  the  a-chain, 
a2j B2  and  The  proportion  of  oc2<52  is  generally  small  and  varies  with 

the  individual.  It  seems  that  the  y-chain  gene  was  also  duplicated  just  before 
the  splitting  of  the  human  and  chimpanzee  lineages.  Man  and  chimpanzee 
both  have  the  same  two  nonallelic  y-chains  which  differ  only  in  one  amino 
acid  position.  Furthermore,  there  seem  to  be  two  identical  genes  for  the 
human  a-chain,  suggesting  another  gene  duplication  in  very  recent  years.  In 
addition  to  the  above  hemoglobin  chains,  there  are  two  other  functional 
hemoglobin  chains,  e and  £,  in  the  human  fetus.  Unfortunately,  however,  the 
amino  acid  sequences  of  these  chains  have  not  yet  been  determined. 

The  above  example  of  globin  evolution  illustrates  how  the  evolutionary 
pathways  of  a group  of  proteins  or  polypeptide  chains  can  be  reconstructed 
by  studying  amino  acid  sequences.  As  mentioned  earlier  (table  8.2),  there 
are  many  groups  of  proteins  in  which  the  sequences  are  closely  related.  At 
the  present  time,  the  sequences  of  these  proteins  are  known  only  for  a small 
number  of  species.  In  the  future,  however,  more  sequence  data  will  be 
available,  and  the  evolutionary  schemes  of  these  proteins  will  eventually  be 
elucidated.  If  this  is  done  for  many  different  groups  of  proteins,  we  will 
be  able  to  understand  what  kind  of  genetic  change  was  important  for  the 
evolution  of  a particular  group  of  organisms  or  of  a particular  morpho- 
logical or  physiological  character.  The  antigen- antibody  reaction  in  verte- 
brates is  one  of  the  most  complex  physiological  systems  in  biology.  There 
are  many  different  proteins  (immunoglobulins)  involved  in  this  system 
(section  6.3).  Amino  acid  sequence  data  of  these  immunoglobulins  suggest 
that  all  of  them  have  evolved  from  a single  ancestral  gene.  For  the  present 
inference  of  the  evolutionary  scheme  of  this  group  of  proteins,  the  reader 
may  refer  to  an  excellent  review  by  Barker  et  al.  (1972). 
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8.5  Adaptive  and  nonadaptive  evolution 

8.5,1  Mechanisms  cf  molecular  evolution 

In  the  foregoing  sections  we  have  discussed  various  aspects  of  evolutionary 
change  of  macromolecules.  Let  us  now  consider  the  underlying  mechanisms 
of  molecular  evolution. 

Following  Kimura  and  Ohta  (1974),  we  can  summarize  the  observations 
about  molecular  evolution  as  follows: 

1)  For  each  informational  macromolecule  the  rate  of  evolution  in  terms 
of  amino  acid  (or  nucleotide)  substitution  is  approximately  constant  per 
year  per  site  for  various  evolutionary  lines,  as  long  as  the  function  of  the 
molecule  remains  the  same. 

2)  Functionally  less  important  molecules  or  parts  of  molecules  evolve 
faster  than  more  important  ones. 

3)  Amino  acid  (or  nucleotide)  substitutions  that  impair  the  function  of  a 
molecule  occur  less  frequently  than  those  maintaining  the  same  function. 

4)  Gene  duplication  generally  precedes  the  emergence  of  a gene  having  a 
new  function. 

Virtually  all  of  the  above  features  of  molecular  evolution  were  uncovered 
as  soon  as  Zuckerkandl  and  Pauling  (1965)  and  Margoliash  and  Smith  (1965) 
started  extensive  studies  of  evolutionary  change  of  macro  molecules.  They 
tried  to  explain  these  observations  in  terms  of  neo-Darwinism,  though  they 
realized  that  they  were  discovering  new  aspects  of  evolution.  For  example, 
Margoliash  and  Smith  thought  that  the  constant  rate  of  amino  acid  sub- 
stitution per  site  per  year  is  possible  if  various  types  of  selection  are  averaged 
out.  For  these  biochemists  or  even  eminent  evolutionists  such  as  Simpson 
(1964)  and  Mayr  (1965),  it  was  unthinkable  at  that  time  that  a mutant  gene 
is  ever  fixed  in  a large  population  without  the  aid  of  natural  selection. 

A careful  examination  of  the  above  features  of  molecular  evolution,  how- 
ever, indicate  that  they  contradict  most  of  the  principles  of  neo-Darwinism 
mentioned  in  the  Introduction  of  this  book.  In  neo-Darwinism  the  rate 
of  evolution  should  depend  on  how  often  and  how  fast  the  environment 
changes.  Thus,  it  would  be  expected  that  the  rate  of  evolution  in  living  fossils 
such  as  the  lamprey  is  much  slower  than  that  of  rapidly  evolved  groups  such 
as  primates.  Tn  practice,  however,  the  hemoglobin  of  the  lamprey  has  diverged 
just  as  far  from  myoglobin  as  have  the  hemoglobins  of  mammals,  as  was 
pointed  out  by  Jukes  (1971).  According  to  neo-Darwinism,  the  rate  of 
evolution  should  also  depend  on  generation  time  rather  than  chronological 
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lime  (chs.  4 and  5).  As  wc  have  already  discussed,  this  prediction  does  not 
hold  for  molecular  evolution.  Clearly,  molecular  evolution  does  riot  obey 
the  principles  of  neo-Darwinism.  On  the  contrary,  as  emphasized  by  Kirnura 
(1969b),  the  constant  rate  of  molecular  evolution  is  most  easily  explained 
by  assuming  that  a majority  of  amino  acid  (or  nucleotide)  substitutions  occur 
by  random  fixation  of  neutral  or  nearly  neutral  mutations.  In  ch,  5,  we  have 
seen  that  the  rate  of  gene  substitution  for  neutral  genes  is  equal  to  the 
mutation  rate  irrespective  of  population  size. 

In  neo-Darwinism  natural  selection  is  the  most  important  factor  in 
evolution,  and  virtually  every  character  of  an  organism  is  regarded  as  a 
product  of  natural  selection.  Thus,  Simpson  (1964)  states  that  'natural 
selection  is  the  composer  of  the  genetic  message,  and  DNA,  RNA,  enzymes, 
and  other  molecules  are  successively  its  messengers'.  This  view  was  challenged 
by  King  and  Jukes  (1969),  who  state:  'Evolutionary  change  is  not  imposed 
upon  DNA  from  without;  it  arises  from  within.  Natural  selection  is  the 
editor,  rather  than  the  composer,  of  the  genetic  message.  One  thing  the  editor 
does  not  do  is  to  remove  changes  which  it  is  unable  to  perceive'.  Ohno 
(1970,  1972)  has  pushed  this  idea  further.  He  states  that  at  the  molecular 
level  the  main  role  of  natural  selection  is  to  conserve  the  already  established 
function  of  a molecule  and  protect  it  from  destructive  mutations.  Here, 
natural  selection  plays  only  a negative  role  not  a constructive  one. 

From  the  review  in  the  foregoing  sections,  it  is  abundantly  clear  that 
mutation  plays  an  important  role  in  molecular  evolution.  Genes  of  new 
function  are  created  by  mutation  from  duplicate  genes.  If  there  are  many 
redundant  genes,  they  would  mutate  freely  without  being  eliminated  by 
natural  selection.  In  a majority  of  cases  such  mutations  will  be  destructive, 
but  once  in  a while  they  may  produce  a gene  of  new  function.  Of  course,  at 
the  early  stage  of  evolution  of  a new  gene  natural  selection  would  play  a 
constructive  role,  sieving  'good'  mutations  which  increase  the  fitness  of 
individuals.  However,  once  a gene  establishes  its  own  function,  natural 
selection  appears  to  operate  mainly  just  to  keep  it  clean.  Mutations  that 
do  not  impair  function  may  be  fixed  in  the  population  by  genetic  drift. 
Therefore,  the  rate  of  evolution  is  determined  by  the  rate  of  neutral  or  nearly 
neutral  mutations.  If  the  mutation  rate  is  constant  per  year,  then  the  rate  of 
gene  substitution  per  year  will  be  constant. 

It  seems  therefore  clear  that  the  observations  about  molecular  evolution 
are  better  explained  by  the  neutral  mutation  hypothesis  (Kirnura,  1968a; 
King  and  Jukes,  1969),  though  the  number  of  proteins  studied  is  still  small. 
Immediately  after  this  hypothesis  was  proposed,  it  was  criticized  by  a 
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number  of  authors.  Most  of  the  criticisms,  however,  seem  to  be  based 
on  misunderstanding  of  the  hypothesis  (see  Kimura  and  Ohta,  1972b). 
For  example,  showing  that  chemically  similar  amino  acid  substitutions 
occur  more  frequently  than  dissimilar  ones,  Clarke  (1970)  took  it  as  evidence 
against  the  neutral  mutation  hypothesis.  As  pointed  out  by  Jukes  and 
King  (1971),  however,  this  observation  is  more  consistent  with  the  neutral 
mutation  theory,  in  which  deleterious  mutations  are  expected  to  occur. 
Nevertheless,  we  must  keep  in  mind  that  this  hypothesis  is  again  the 
majority  rule  and  does  not  prohibit  exceptions.  Indeed  there  must  always 
be  a certain  number  of  adaptive  gene  substitutions  when  a population  is 
adapting  to  a new  environment.  However,  such  gene  substitutions  appear 
to  be  a minority  of  the  total  gene  substitutions  that  are  taking  place  simul- 
taneously. Note  that  in  a randomly  mating  population  30  to  50  percent  of 
loci  are  polymorphic  and  a polymorphic  locus  often  has  more  than  two 
alleles.  Even  if  90  percent  of  mutant  alleles  are  neutral,  there  are  still  a large 
number  of  alleles  which  may  be  used  for  adaptive  evolution. 

In  ch.  5 we  have  emphasized  that  the  definition  of  neutral  genes  depends 
on  population  size  and  in  small  populations  slightly  advantageous  or  dis- 
advantageous mutations  may  behave  just  like  neutral  genes.  If  we  note  that 
disadvantageous  mutations  are  probably  much  more  frequent  than  advan- 
tageous mutations,  it  is  expected  that  a considerable  number  of  slightly 
deleterious  mutations  are  fixed  in  the  population  (Mayo,  1970;  Ohta  and 
Kimura,  1971b).  Ohta  (1972b,  1973)  regards  this  as  one  of  the  important 
aspects  of  molecular  evolution.  According  to  her,  slightly  disadvantageous 
mutations  are  fixed  in  the  population  more  often  than  advantageous 
mutations.  Fixation  of  disadvantageous  mutations  will  of  course  result  in  a 
reduction  in  fitness,  but  it  will  be  recovered  by  occasional  fixation  of  advan- 
tageous genes.  She  believes  that  this  provides  an  explanation  of  Fitch's 
concept  of  unstable  covarions.  Namely,  if  a mutation  disturbs  the  function 
of  a molecule  very  slightly,  there  may  arise  many  possible  ways  of  compen- 
sating the  effect  of  the  mutation,  thus  opening  a possibility  of  change  of 
covarions.  The  small  but  significant  variation  in  the  rate  of  amino  acid 
substitution  discussed  earlier  may  also  be  due  to  the  alternate  fixation  of 
slightly  disadvantageous  mutations  and  advantageous  mutations.  If  this  is 
the  case,  Romero-Herrera  et  al.’s  (1973)  observation  that  the  rate  of  amino 
acid  substitution  is  roughly  constant  on  the  long-term  basis  but  varies 
considerably  on  the  short-term  basis  is  no  longer  mysterious.  Furthermore, 
Ohta's  hypothesis  can  be  used  to  explain  the  interspecific  variation  in  function 
of  cytochrome  c and  hemoglobins.  Although  cytochromes  c from  virtually 
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all  organisms  are  interchangeable  in  in  vitro  tests  with  substrates,  there  is 
variation  in  ion-binding  properties  (Margoliash  et  al.,  1970).  Hemoglobins 
from  different  primate  species  also  show  a variation  in  oxygen-binding 
properties  (Sullivan,  1972).  In  these  cases,  however,  nothing  is  known  about 
the  relationship  between  the  interspecific  variation  and  fitness. 

Ohta  further  predicts  that  the  rate  of  evolution  is  more  rapid  in  small 
populations  than  in  large  populations.  This  prediction  is  based  on  her  view 
that  the  selection  coefficient  of  a mutant  gene  is  variable  both  spacially  and 
temporally  because  of  environmental  variation.  Thus,  in  a large  population 
which  occupies  a large  territory  an  advantageous  mutation  must  be  beneficial 
in  many  different  environmental  conditions.  On  the  other  hand,  in  a small 
population  environmental  variation  is  likely  to  be  small,  so  that  a mutant 
gene  would  be  advantageous  more  often  than  in  a large  population.  Further- 
more, in  a small  population  even  slightly  deleterious  genes  may  be  fixed. 
Thus,  the  rate  of  gene  substitution  is  likely  to  be  higher  in  small  populations 
than  in  large  populations.  This  view  is  in  contrast  with  Wright's  (1931,  1932, 
1956,  1970)  balance-shift  theory  of  evolution,  in  which  a large  population 
subdivided  into  many  local  demes  provides  the  most  favorable  condition  for 
evolution.  In  the  case  of  nonadaptive  evolution,  it  is  probable  that  more  gene 
substitutions  occur  in  small  populations  than  in  large  populations.  In  the 
foregoing  chapter  we  have  also  seen  the  possibility  that  speciation  occurs 
more  quickly  in  small  populations.  With  respect  to  adaptive  evolution,  how- 
ever, we  do  not  know  which  of  the  two  hypotheses  is  correct,  though  there 
is  some  paleontological  evidence  that  rapid  evolution  often  occurs  in  small 
populations  (Simpson,  1953).  We  note,  however,  as  Ohta  did,  that  small 
populations  are  expected  to  have  a much  higher  chance  of  extinction  than 
large  populations. 

At  any  rate,  data  on  molecular  evolution  are  explained  more  easily  by  the 
neutral  mutation  theory  than  by  neo-Darwinism.  It  should,  however,  be 
remembered  that  this  theory  is  heavily  dependent  on  the  assumption  that  the 
rate  of  neutral  or  near-neutral  mutations  is  constant  per  year  rather  than  per 
generation.  If  this  assumption  is  not  correct,  the  neutral  mutation  theory 
will  be  seriously  impaired.  In  ch.  3 we  presented  some  evidence  to  support 
this  assumption,  but  the  rate  of  neutral  mutations  is  largely  unknown.  It  is 
therefore  an  urgent  need  to  test  the  constancy  of  the  rate  of  neutral  mutations 
by  using  a variety  of  organisms. 
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8.5.2  Polymorphism  as  a phase  of  evolution 

In  neo-Darwinism  the  genetic  variation  within  a population  is  regarded  as  a 
storage  from  which  the  variation  required  for  future  evolution  may  be  drawn. 
This  storage  is  supposed  to  contain  almost  any  kind  of  genetic  variation, 
so  that  the  population  can  adapt  to  any  environmental  change.  At  the 
molecular  level,  however,  this  view  is  not  supported  at  all,  since  the  genetic 
variation  within  populations  is  quite  different  in  different  species.  Even  at  the 
level  of  electrophoretically  detectable  proteins,  two  closely  related  species 
often  have  different  alleles  (ch.  7).  The  proportion  of  common  polymorphic 
alleles  between  two  different  genera  is  negligibly  small.  Clearly,  the  genetic 
variation  at  the  molecular  level  is  not  the  same  for  all  species  but  reflects 
its  own  evolutionary  history.  It  is  a product  of  evolution  rather  than  the 
storage  designed  for  future  use. 

At  the  molecular  level  polymorphism  within  populations  may  also  be 
regarded  as  a phase  of  evolution,  as  emphasized  by  Kimura  and  Ohta 
(1971a).  Namely,  a majority  of  polymorphisms  must  be  transient.  In  fact, 
the  level  of  average  heterozygosity  for  protein  loci  in  outbreeding  organisms 
roughly  agrees  with  the  value  expected  from  the  rate  of  gene  substitution 
(ch.  6).  Earlier,  we  have  noted  the  difficulty  in  distinguishing  between  different 
mechanisms  of  maintenance  of  polymorphism  from  the  study  of  gene  fre- 
quencies in  natural  populations.  However,  since  molecular  evolution 
strongly  supports  the  neutral  mutation  theory  and  the  observed  level  of 
average  heterozygosity  agrees  with  the  expected  value,  it  is  likely  that  the 
majority  of  protein  polymorphism  in  the  present  natural  populations  is  also 
due  to  neutral  or  nearly  neutral  mutations.  Transient  polymorphism  may 
also  occur  by  advantageous  genes,  but  the  contribution  of  these  genes  to 
polymorphism  is  apparently  very  small  (ch.  6). 

In  ch.  6 I have  indicated  that  protein  polymorphism  due  to  balancing 
selection  may  be  detected  by  examining  the  amino  acid  sequence  of  homo- 
logous proteins  in  many  different  organisms,  since  such  a polymorphism 
should  persist  for  a long  time.  Many  organisms  show  polymorphism  for 
hemoglobin  and  fibrinopeptide  (Dayhoff,  1972),  but  none  of  them  are 
polymorphic  for  the  same  pairs  of  alleles  or  same  pairs  of  codons.  This 
indicates  that  a polymorphism  for  a particular  set  of  alleles  cannot  persist 
for  a long  time.  This  would  reflect  either  the  rarity  or  temporariness  of 
balancing  selection.  If  this  is  the  case,  balancing  selection  cannot  contribute 
to  polymorphism  very  much.  Note  that  even  a neutral  allele  may  persist 
for  a surprisingly  long  time  - often  longer  than  the  species  life  (ch.  5). 


Adaptive  and  noitarfapfive  evolution 


As  mentioned  earlier  (eh.  4),  the  ABO,  MN,  and  Lewis  blood  group  loci 
in  man  and  some  primates  seem  to  be  polymorphic  for  the  same  or  similar 
alleles.  However,  the  biochemical  relationship  between  blood  group  pheno- 
types and  their  genes  is  poorly  understood  at  the  present  time,  so  that  it  is 
not  certain  whether  the  alleles  A,  B,  O,  etc.,  in  man  are  the  same  as  those  in 
orangutan  at  the  codon  level. 

8.5.3  Molecular  evolution  and  nwrphohtgica!  change 

Although  the  main  purpose  of  this  book  is  to  discuss  molecular  variation 
and  evolution,  it  seems  appropriate  briefly  to  consider  the  implications  of 
molecular  evolution  on  morphological  or  physiological  change.  At  the 
present  time  it  is  widely  accepted  that  evolution  of  morphological  or  physio- 
logical characters  occurs  following  the  principles  of  neo-Darwinian  evolution 
(eh.  1).  Some  extreme  neo-Darwinian  evolutionists  maintain  the  view  that 
all  these  characters  are  the  product  of  natural  selection  and  every  genetic 
variation  in  them  has  some  adaptive  significance.  In  this  view  the  role  of 
genetic  drift  in  evolution  is  virtually  neglected  (Ford,  1964). 

There  is  a large  amount  of  data  to  support  neo-Darwinian  evolution  with 
respect  to  major  aspects  of  morphological  evolution.  In  the  evolution  of 
these  characters  generally  several  or  many  gene  loci  are  concerned.  If  there 
are  enough  favorable  mutations  in  a population,  it  is  not  impossible  to 
produce  a genotype  that  is  adapted  to  a particular  environment  without  the 
aid  of  natural  selection.  In  the  absence  of  natural  selection,  however,  the 
probability  of  fixation  of  such  a genotype  in  the  population  is  extremely 
small.  Namely,  evolution  without  natural  selection  is  very  slow.  On  the 
other  hand,  if  natural  selection  operates,  the  frequencies  of  favorable  genes 
rapidly  increase,  and  with  the  aid  of  recombination  mechanism  the  favorable 
genes  in  different  individuals  are  easily  combined  into  single  individuals 
which  will  then  have  a further  increased  fitness.  Therefore,  natural  selection 
speeds  up  evolution  tremendously.  There  is  no  question  that  natural  selection 
played  an  important  role  in  the  evolution  of  many  intricate  characters  of 
higher  organisms.  This  is  particularly  so  when  a character  is  controlled  by  a 
series  of  interacting  gene  loci. 

Nevertheless,  the  relationship  between  a morphological  character  and 
fitness  in  a given  environment  is  often  obscure.  In  general,  a considerable 
amount  of  variation  in  a quantitative  character  seems  to  be  tolerated  by  the 
environment  in  which  the  organism  lives.  For  example,  the  variations  in 
stature  and  weight  in  human  adults  are  not  directly  related  to  fitness,  except 
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for  extreme  individuals  in  both  ends.  Clayton  and  Robertson  (1955)  and 
Robertson  (1967)  have  shown  that  the  genetic  variation  in  bristle  number  of 
Drosophila  melanogaster  is  apparently  largely  neutral.  Thus,  even  morpho- 
logical characters  may  be  subject  to  change  due  to  genetic  drift.  Namely, 
at  least  some  part  of  morphological  differences  between  species  must  be  due 
to  random  fixation  of  genes  (Wright,  1932). 

We  know  that  the  so-called  living  fossils  such  as  the  horseshoe  crabs  and 
lamprey  have  maintained  the  same  morphological  characters  for  a long  time. 
The  usual  explanation  for  this  is  that  these  organisms  are  so  well  adapted 
to  a particular  continuously  available  environment,  that  almost  any  mutation 
occurring  in  them  is  disadvantageous  (Simpson,  1953).  This  seems  to  be 
true  at  the  morphological  level.  At  the  gene  level,  however,  it  is  likely  that 
as  long  as  new  mutations  do  not  change  the  morphology  drastically  they 
may  be  incorporated  into  the  genome,  so  that  genes  are  constantly  changing 
even  at  loci  which  control  morphology.  The  extensive  protein  polymorphisms 
discovered  in  the  horseshoe  crab  (Selander  et  a!.,  1970)  and  Lycopodium 
(Levin  and  Crepet,  1973)  seem  to  support  this  view,  though  the  relationship 
between  protein  polymorphism  and  morphological  variation  has  not  been 
clarified. 

In  neo-Darwinism  mutation  plays  a minor  role  in  determining  the  rate  of 
evolution.  It  is  assumed  that  since  mutation  occurs  recurrently  most  natural 
populations  contain  enough  genetic  variability  and  thus  the  rate  of  evolution 
is  determined  mainly  by  the  change  of  environment  and  natural  selection 
(ch.  1).  At  the  molecular  level,  however,  this  assumption  cannot  be  justified. 
Clearly,  mutations  are  mostly  unique  and  do  not  recur  (ch.  3).  This  would 
be  particularly  so  for  advantageous  mutations,  since  the  frequency  of  these 
mutations  must  be  very  small.  We  would  then  expect  that  the  rate  of  adaptive 
evolution  is  controlled  not  only  by  natural  selection  but  also  by  mutation 
rate.  If  a population  is  not  equipped  with  favorable  mutations  when  a drastic 
environmental  change  occurs,  it  would  simply  be  extinct  or  remain  unadapted 
until  new  mutations  occur.  It  is  possible  that  a large  proportion  of  extinct 
species  in  the  past  lacked  such  favorable  mutations  to  cope  with  environ- 
mental changes.  Then,  it  is  not  surprising  that  more  than  99  percent  of  the 
species  in  the  past  have  become  extinct.  At  any  rate,  mutation  seems  to  be 
very  important  even  in  adaptive  evolution. 

In  the  early  20th  century  De  Vries  and  his  followers  maintained  the  theory 
that  evolution  occurs  mostly  by  mutations  with  large  phenotypic  effects. 
They  thought  that  the  effect  of  natural  selection  is  too  small  to  transform  a 
species  into  another.  The  large-effect  mutations  with  which  this  school  was 
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concerned  later  proved  to  be  rare  or  of  no  evolutionary  consequence.  Also, 
in  this  theory  little  attention  was  paid  to  the  fact  that  evolution  occurs 
through  genetic  change  of  populations  rather  than  individuals.  Realization 
of  these  deficiencies  in  mutationism  has  resulted  in  the  rise  of  neo-Darwinism 
or  the  synthetic  theory  of  evolution,  and  by  1950  mutationism  was  in  full 
retreat.  As  a consequence,  the  view  that  mutation  is  the  main  factor  of 
evolution  has  completely  been  rejected.  As  was  recently  emphasized  by 
Kimura  and  Ohta  (1974),  however,  neo-Darwinism  should  be  reexamined. 
Although  the  mutation  we  see  now  is  different  from  that  of  De  Vries  and 
generally  minute  in  effect,  it  seems  to  be  the  primary  factor  of  evolution 
at  both  the  molecular  and  morphological  levels. 
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